Long-lived WebSocket Connections

I am developing a data capture and trading automation tool, which uses long-lived connections with both the trade and market data servers. Occasionally my connection gets dropped—without a close frame or event message—for a couple reasons I’ve identified:

  • Failing to send heartbeats often enough
  • Sending different requests with an identical integer i
  • Sending multiple requests too rapidly
  • Maintaining a connection for over 24 hours

Sometimes one (or both) connection drops without being obviously related to the above causes. My socket handlers check connectivity after an EOF by pinging the WebSocket server and 1.1.1.1, but that is almost never the issue. At this point I’m concerned it’s a task scheduling issue that will be difficult to debug with multithreading.

Has anyone else stumbled on other avoidable causes for connection loss and/or end-of-file?

A quick note about sending heartbeats on receiving heartbeats - this will work if you aren’t streaming real-time data with a websocket. When you start streaming real-time data, the websocket server won’t send heartbeats anymore. Make sure you have some kind of fallback for this scenario.

Also keep in mind that using setInterval or setTimeout isn’t the best choice either, because of how browsers throttle these functions. You could use a Date-based mechanism that compares the time between messages received (as well as returning a heartbeat on receiving a heartbeat, keep doing that). This way even if you are running your app in an inactive tab, it will continue to function as expected.

Agree with sending heartbeats independently from the server’s heartbeats. I’ve just been scheduling a task that runs every 2-2.5 seconds to send the empty brackets. No problems yet.

The best workaround I’ve found so far for sporadic disconnects:

  • Run a separate authorizer task for both websockets. Whenever a socket receives an open frame after reconnecting, send an authorization request with the current token.
  • When generating an access token, use the expiration datetime to schedule the next token generation. This way there is always a valid token with which to authenticate.
  • Unless handling an application shutdown request or a server close frame, assume the connection was dropped in error. Reopen the connection and resubscribe to the desired market data symbols. When the new connection opens, this will trigger an automatic authentication as above.

On a side note, I’ve still not created a clean way to handle dropped connections in the middle of a request. In other words, client sends request…connection drops…client waits for response indefinitely. I figure a solution needs both an association with the request integer i, and a timeout after X seconds have passed without receiving the corresponding response i. Checking for i on a response channel or pipe that gets popped by parallel tasks is prone to race conditions. Anyway, if I land on something elegant, I’ll definitely share it here.

I’m a fan of using a closure for indexing the requests! Great suggestion.

We’re way off the topic of connection loss, but anyway, here’s my crack at a thread-safe version in Julia:

mutable struct Index
    i::Int64
    lk::ReentrantLock
    Index() = new(0, ReentrantLock())
end
(idx::Index)() = begin
    lock(idx.lk)
    retval::Int64 = -1
    try
        retval = idx.i += 1
    finally
        unlock(idx.lk)
        return retval
    end
end

EDIT: I guess my pattern is a “callable struct”, but similar idea. The counter value is isolated to a callable object that is run by anything needing a new integer. Pretty neat.

To create an indexing object called “data_i”, to be used globally: data_i = Index()

To return a new integer, in whatever task needs it: data_i()

Linking to a related topic with more advice:

hi, do you know how to close a socket session manully?I need test in replay model frequently,but some time run up program will socket time out , I think there still has a socket in TR server ,so will reject my reconnect frequently.