Long-lived WebSocket Connections

I am developing a data capture and trading automation tool, which uses long-lived connections with both the trade and market data servers. Occasionally my connection gets dropped—without a close frame or event message—for a couple reasons I’ve identified:

  • Failing to send heartbeats often enough
  • Sending different requests with an identical integer i
  • Sending multiple requests too rapidly
  • Maintaining a connection for over 24 hours

Sometimes one (or both) connection drops without being obviously related to the above causes. My socket handlers check connectivity after an EOF by pinging the WebSocket server and 1.1.1.1, but that is almost never the issue. At this point I’m concerned it’s a task scheduling issue that will be difficult to debug with multithreading.

Has anyone else stumbled on other avoidable causes for connection loss and/or end-of-file?

For heartbeats my code just sends one back when it receives one:

ws.on('message', async (message) => {
  try {
    const type = message.slice(0, 1)
    message = message.slice(1, message.length)
    //open frame
    if (type === 'o') {
      //...
    }
    //heartbeat frame
    else if (type === 'h') {
      ws.send('[]')
    }
    //array frame
    else if (type === 'a') {
      const messages = JSON.parse(message)
      messages.forEach(async (eventOrResponse) => {
        //...
      }
    }
    //close frame
    else if (type === 'c') {
      //...
    }
    else { throw ('unknown message type ' + type) }
  }
  catch (err) {
    ws.close()
    console.error(err)
    process.exitCode = 1
    process.exit()
  }
})

For requests my code uses a closure to keep giving different ids:

//every request gets a unique id
function makeNextId() {
  let count = 0
  function nextId() {
    let oldCount = count
    count += 1
    return oldCount
  }
  return nextId
}

I haven’t run into heartbeat problems, request id problems, or too rapid request problems when handling p-ticket and p-time. Although for market replay it always kicks me out after a minute or so even when handling clock synchronization events. As for connections over 24 hours long I haven’t been doing much of that although I plan on doing more of it eventually so I hope it doesn’t become a problem. If it kicked me out of a connection I suppose I would just reauthenticate with another connection although I’m not sure how long for example an oauth authentification would last.

A quick note about sending heartbeats on receiving heartbeats - this will work if you aren’t streaming real-time data with a websocket. When you start streaming real-time data, the websocket server won’t send heartbeats anymore. Make sure you have some kind of fallback for this scenario.

Also keep in mind that using setInterval or setTimeout isn’t the best choice either, because of how browsers throttle these functions. You could use a Date-based mechanism that compares the time between messages received (as well as returning a heartbeat on receiving a heartbeat, keep doing that). This way even if you are running your app in an inactive tab, it will continue to function as expected.

Agree with sending heartbeats independently from the server’s heartbeats. I’ve just been scheduling a task that runs every 2-2.5 seconds to send the empty brackets. No problems yet.

The best workaround I’ve found so far for sporadic disconnects:

  • Run a separate authorizer task for both websockets. Whenever a socket receives an open frame after reconnecting, send an authorization request with the current token.
  • When generating an access token, use the expiration datetime to schedule the next token generation. This way there is always a valid token with which to authenticate.
  • Unless handling an application shutdown request or a server close frame, assume the connection was dropped in error. Reopen the connection and resubscribe to the desired market data symbols. When the new connection opens, this will trigger an automatic authentication as above.

On a side note, I’ve still not created a clean way to handle dropped connections in the middle of a request. In other words, client sends request…connection drops…client waits for response indefinitely. I figure a solution needs both an association with the request integer i, and a timeout after X seconds have passed without receiving the corresponding response i. Checking for i on a response channel or pipe that gets popped by parallel tasks is prone to race conditions. Anyway, if I land on something elegant, I’ll definitely share it here.

As @Alexander has suggested, I’ve altered the code to send heartbeats independently. It now checks the time since the last websocket message has been received and sends a heartbeat if it’s been over or at 2.5 seconds since the last message. Also it uses setInterval to send heartbeats every 2.5 seconds and still sends a heartbeat when it receives a heartbeat. With these changes the market replay connection doesn’t get dropped after a minute which is nice.

@Brady If you’re using javascript then Promise.race might be what you’re looking for. It takes multiple promises and returns a promise that returns when one of the input promises finishes.

I’m a fan of using a closure for indexing the requests! Great suggestion.

We’re way off the topic of connection loss, but anyway, here’s my crack at a thread-safe version in Julia:

mutable struct Index
    i::Int64
    lk::ReentrantLock
    Index() = new(0, ReentrantLock())
end
(idx::Index)() = begin
    lock(idx.lk)
    retval::Int64 = -1
    try
        retval = idx.i += 1
    finally
        unlock(idx.lk)
        return retval
    end
end

EDIT: I guess my pattern is a “callable struct”, but similar idea. The counter value is isolated to a callable object that is run by anything needing a new integer. Pretty neat.

To create an indexing object called “data_i”, to be used globally: data_i = Index()

To return a new integer, in whatever task needs it: data_i()

Linking to a related topic with more advice:

hi, do you know how to close a socket session manully?I need test in replay model frequently,but some time run up program will socket time out , I think there still has a socket in TR server ,so will reject my reconnect frequently.