WebSocket disconnect loops: fix real-time features after launch
WebSocket disconnect loops can ruin real-time apps after launch. Learn how to debug auth, timeouts, scaling, and add safe fallbacks.

Why real-time features break after launch
“Real-time” usually means the screen updates without a refresh. Chat messages appear instantly, presence shows who’s online, dashboards tick, and alerts pop up the moment something changes.
These features can look perfect in a demo and then fall apart after launch because production behaves differently. More users means more simultaneous connections. A proxy or load balancer sits between the browser and your server. Tokens expire. Phones switch networks, go to sleep, and wake up. Any one of those can turn a stable connection into a loop of disconnects and reconnects.
To users, a WebSocket disconnect loop feels like “it works for a second, then breaks.” Messages arrive late or not at all. The UI flashes “reconnecting…” repeatedly. Presence flickers. Dashboards freeze and then jump. Sometimes users get duplicates because the client resends after reconnect.
What changes after launch
A few predictable shifts push sockets over the edge: more concurrent connections than you tested locally, proxies that time out “idle” connections, WebSocket auth that fails during refresh or token rotation, mobile network drops that trigger aggressive reconnects, and multiple server instances without shared state (or without sticky sessions when you rely on in-memory state).
Socket reliability isn’t “never disconnect.” Disconnects will happen. Reliability means the app recovers quickly and safely, and it doesn’t lose important events. Missing a typing indicator is fine. Missing “message sent,” order status, or payment state is not.
Spot the pattern before you change anything
When real-time breaks after launch, the fastest way to waste time is to “fix” code blind. Start by describing the loop so clearly you can predict the next disconnect.
Name the symptom you actually see. “It’s flaky” hides clues. “Reconnects every 6 seconds and duplicates notifications” is something you can trace.
Before touching settings, answer a few questions:
- Which clients are affected (web, iOS, Android, one browser)?
- Which environment (local, staging, production only)?
- Does it hit everyone or specific accounts?
- Does it cluster during busy hours or right after deploys?
Then capture evidence while it’s happening. A short snapshot beats hours of guessing. At minimum, collect timestamps (client and server), a user or session ID, a connection ID you generate per socket, close code and reason (if available), and the last few events before the drop (connect, auth, subscribe, ping/pong).
To separate server-side disconnects from client-side drops, compare timelines. If the client shows code 1006 (abnormal closure) and the server doesn’t log a clean close, suspect network, proxy timeouts, or the app going to sleep. If the server closes right after “auth” or “subscribe,” suspect your own logic (bad token, missing permissions, thrown exceptions).
One practical trick: reproduce with one user in one tab first. If you can’t, the trigger may be load-related.
Step-by-step: debug a disconnect loop
When you see a disconnect loop, resist the urge to tweak timeouts first. Make the loop visible: what happened right before the socket dropped, and who is scheduling the reconnect.
Start with plain logs around the socket lifecycle. You want a clean story from start to end: connect, auth, subscription, message flow, then the close reason. Include timestamps and a short connection ID so multiple tabs don’t blur together.
Log the basics in order: connect started/open, auth sent and success/failure, subscribe sent and ack, close/error (code and reason), and reconnect scheduled (and by what).
Then reproduce with the smallest setup: one user, one tab, no background jobs. Once it fails reliably, add complexity one step at a time (second tab, second user, higher message volume). That tells you whether the trigger is load, concurrency, or a specific subscription.
Next, inspect close codes and errors. Policy-type closes often point to auth or origin rules. Timeouts usually point to heartbeats, proxies, or the server being blocked. Abnormal closes often mean something crashed or the network vanished without a proper close.
Also check whether you have two reconnect mechanisms at once: your code plus the socket library’s default. That can create reconnect storms even when the underlying issue is small.
Finally, test from a different network (mobile hotspot vs office Wi‑Fi). If it only happens on one network, focus on proxies, VPNs, captive portals, or aggressive idle timeouts.
Auth on sockets: where it usually goes wrong
Many “networking” bugs are actually auth bugs. The app loads fine, API calls work, then the live feature gets stuck reconnecting.
Three common auth setups
Most apps authenticate sockets in one of three ways: reuse a cookie session, send a bearer token during the connection, or fetch a short-lived one-time “socket token” over HTTPS first. All can work, but each has a common failure mode.
A classic mismatch: the normal HTTP requests are authenticated, but the WebSocket handshake isn’t. Cookies may be sent to your API but blocked for the socket because of cross-origin rules. Or the server expects an Authorization header, but the client library can only pass a token via query param or subprotocol.
Common post-launch failures:
- Cookies aren’t included because of SameSite, Secure, or domain settings (works on localhost, breaks on the real domain).
- The socket connects before the session is ready.
- The socket holds a stale access token after refresh and gets kicked repeatedly.
- The server closes as “unauthorized” and the client instantly reconnects, creating noisy spam.
Handle unauthorized closes without endless reconnects
Treat auth failures differently from flaky network failures. If the server closes with an auth-related code or message, stop the reconnect loop and recover the session first. Refresh the token (or prompt login), then open a fresh socket with the new credentials.
If you must retry, use backoff (1s, 2s, 5s, 10s) and add jitter so many clients don’t reconnect at the same moment.
A common scenario: a dashboard runs for an hour, the token refreshes, but the socket keeps sending the old token and gets closed every few seconds. The fix isn’t “more retries.” The fix is restarting the socket when the token changes.
Heartbeats, timeouts, and reconnect behavior
A lot of disconnect loops come down to this: the connection sits idle and something in the middle kills it. That “something” might be your server, a proxy or load balancer, a CDN, hotel Wi‑Fi, or a phone that suspends background activity.
The usual fix is a heartbeat plus sane reconnect behavior. Heartbeats can be ping/pong (best when your library supports it) or a tiny app-level keepalive message. Either way, you want enough traffic that intermediaries don’t mark the socket as idle.
Be conservative with timing. Many proxies drop idle connections around 30 to 60 seconds. A common starting point is a heartbeat every 15 to 25 seconds, and a client timeout after 2 to 3 missed heartbeats. Too aggressive wastes battery and data on mobile. Too slow dies silently.
Reconnect logic is the other half. Instant reconnects can create a storm, especially after a deploy or a brief outage. Use jittered exponential backoff with a cap, reset backoff only after the connection stays healthy for a short window, and make reconnect idempotent: re-auth and resubscribe, but don’t duplicate subscriptions.
Half-open connections are the sneaky case: the client thinks it’s connected, but the server is gone. Heartbeat timeouts let you detect that quickly.
Proxies and load balancers: hidden disconnect causes
If real-time worked locally but flaps in production, check the network path before rewriting socket code. Reverse proxies, CDNs, and load balancers can close idle connections, rotate instances, or drop headers.
What proxies change about WebSockets
A WebSocket starts as an HTTP request and then upgrades. Everything in front of your app must support that upgrade and keep the connection open. Many setups also enforce idle timeouts or a max connection age. If your app only sends data when the user clicks, the connection can look idle and get cut.
Sticky sessions are another trap. If you keep important state in memory (subscriptions, room membership, user context), and the load balancer sends a reconnect to a different instance, the user “connects” but misses events or fails state checks. Shared state (Redis, database, message broker) reduces the need for stickiness.
TLS termination and auth header surprises
When TLS is terminated at the proxy, your app may see the request as HTTP unless forwarded headers are set correctly. That can break checks like “only allow secure cookies” or strict origin rules. Some proxies also strip or rename headers, which breaks token-based authentication.
To confirm what’s closing the connection, compare close codes on both sides, look for proxy log messages like “upstream timeout” or “idle timeout,” temporarily increase idle timeouts to see if the problem stops, and verify upgrade and forwarded headers are reaching the app.
Scaling real-time features without losing events
Real-time often works in staging because there’s one server. After launch, a second instance appears (or your platform starts moving traffic), and messages go missing. Broadcasts only reach users on the same machine. Rooms and presence become inconsistent. Reconnects can land on a server that doesn’t recognize the client’s state.
The first rule: don’t keep important socket state only in memory. That includes subscription state, user-to-socket mappings, presence, and the last-seen event ID. In-memory state disappears on deploy and differs per server.
Most apps end up using one of these patterns: shared pub/sub so any server can publish and all servers can deliver, a dedicated real-time service that owns connections while API servers stay stateless, or a queue for events that must not be lost so they can be retried safely.
Reconnects are where duplicates sneak in. Use an event ID (or sequence number) per channel and have the client send “last received.” On the server, keep handlers idempotent so processing the same event twice doesn’t double-charge, double-create, or double-send.
Deploys need a plan, too. If you restart servers without warning, you force mass reconnects and race conditions. Add a drain step: stop accepting new connections on the old instance, let existing ones finish, then terminate.
Fallbacks that keep the app usable
Real-time is great until it isn’t. When users hit disconnect loops, you want two things: stable sockets and an app that still works when sockets aren’t stable.
WebSockets shine for two-way interaction (chat, multiplayer, live cursors). If the client mostly receives updates (status changes, notifications, dashboards), Server-Sent Events (SSE) can be simpler and more reliable because it uses a standard HTTP connection and usually behaves better through proxies.
A practical fallback is controlled degradation: try WebSockets, switch to SSE if the socket fails to open or drops too often, fall back to short polling if needed, and if reconnect keeps failing, put the app in a limited mode (read-only or “send later”) while retrying quietly.
Keep the UI honest. Show connection state (Connected, Reconnecting, Offline) and the last update time. A “Retry now” button helps when someone just changed networks.
On the server, design streams so a client can resume after reconnect. Send events with IDs or timestamps and allow “everything since X.” For SSE, you can use Last-Event-ID. For WebSockets, use a resume cursor or token on connect.
Common mistakes that create fragile sockets
Not every disconnect is a bug. Mobile networks drop. Laptops sleep. Browsers pause background tabs. The fragile part is when your app treats normal disconnects like emergencies and retries so aggressively it creates a self-inflicted outage.
One avoidable security mistake is putting secrets in the URL. Query strings get copied into logs, analytics, error reports, and screenshots. If your socket token is in the URL, assume it will leak. It’s also easy to log tokens by accident when dumping handshake data while debugging.
Local development can mislead you. On localhost there’s no corporate proxy, no load balancer, and no idle timeout policy. In production, a proxy might close idle connections, strip headers, or block upgrade requests.
A few patterns that usually make sockets brittle:
- Retry loops with no backoff or jitter.
- Auth in query strings, or logging that captures tokens.
- No server-side limits on reconnect attempts per user/IP.
- Reconnect logic that resubscribes blindly and creates duplicate listeners.
- Skipping tests behind a proxy or load balancer.
Duplicate subscriptions are especially sneaky. After reconnect, the client may rejoin the same room or register the same handler again while the server never cleans up the old one. Fix this by making subscriptions idempotent per connection and tracking connection IDs so a new socket can replace the old one cleanly.
Quick checks before you ship the fix
Before rolling out a WebSocket change, do a quick pass across the client, the server, and the infrastructure. Most disconnect loops aren’t one bug. They’re two or three small issues that only show up together.
Client-side checks
Your client should stay calm under failure. Use reconnect backoff with jitter and a cap, show connection state in the UI, add resume logic (last event ID or version) so brief drops don’t lose data, dedupe events so reconnects don’t double-apply updates, and close sockets cleanly on logout or account switch.
Server and infrastructure checks
Make disconnects understandable. If the server closes a connection, it should be for a clear reason, and one log line should explain it. Use clear close codes, enforce auth on connect and on sensitive messages, configure heartbeats and timeouts so healthy clients aren’t kicked, set connection limits per user/IP, and confirm proxy/load balancer WebSocket settings (including idle timeouts and whether stickiness is required).
A quick rule: if you can’t explain a disconnect from one log line, you’re not ready to ship.
Fast tests that catch regressions
Run a few scenarios that often reproduce loops: one user with many tabs while logging in and out, many users connecting at once (even a small load test), deploy while users are connected and watch reconnect behavior, and simulate flaky networks (toggle Wi‑Fi/cellular, sleep a laptop) to confirm the app recovers.
What “done” looks like: connections stay up for predictable periods, reconnects slow down instead of speeding up, and when a drop happens you can point to one clear cause.
A realistic example and next steps
A founder launches a live sales dashboard. In staging it looks perfect. On launch day, support tickets roll in: the page flashes “Reconnecting…” every few seconds, and some users never get live updates.
The first clue is in server logs: many connections end right after an access token expires. On the client, the app reconnects quickly but keeps reusing the same expired token, so it gets kicked again. The fix is making the socket handshake use a fresh token (or a short-lived socket token) and forcing a refresh before reconnect.
Then a second pattern appears: users who leave the page open get disconnected at almost exactly 60 seconds. That points to an infrastructure timeout. The load balancer drops idle connections, and the app isn’t sending heartbeats. A ping every 25 seconds plus a sane idle timeout stops the flapping.
Document what you changed so it stays fixed: expected close codes, token rules for sockets (where it comes from, when it refreshes, what happens on 401), heartbeat settings (ping interval, pong timeout) plus any proxy idle timeout, and reconnect rules (backoff timing, max attempts, when to stop and show a “Refresh” button).
Sometimes patching is slower than a refactor. If your socket handler mixes connection/auth, business logic, database writes, and permission checks in one place, small changes create new breakage. Split responsibilities: one layer for connection and auth, one for events, one for data.
If you’re dealing with an AI-generated prototype that worked in a demo but keeps breaking in production, FixMyMess (fixmymess.ai) can help diagnose the socket flow, repair auth and reconnect logic, and harden the app for real traffic after a free code audit.
FAQ
Why did my WebSocket real-time feature work in a demo but break after launch?
Because production adds real-world stress and “stuff in the middle.” More concurrent connections, proxies with idle timeouts, token refresh cycles, mobile sleep/network switches, and multiple server instances can all turn a stable demo into repeated disconnects and reconnects.
What’s the minimum info I should log to debug a disconnect loop?
Capture a tight timeline around one connection: client and server timestamps, a user/session identifier, a per-connection ID you generate, the close code and reason, and the last events before the drop (open, auth, subscribe, ping/pong). With that, you can usually tell whether the client disappeared, the proxy cut it, or your server logic closed it.
What does WebSocket close code 1006 usually mean?
A 1006 is an abnormal closure, meaning the browser didn’t get a clean close frame. That often points to network drops, the app going to sleep, proxy/load balancer timeouts, or a server crash that ended the TCP connection without a proper WebSocket close.
How should I handle “unauthorized” socket closes without endless reconnecting?
Don’t treat it like a flaky network. Stop the reconnect loop, refresh the session or token first, then open a fresh socket using the new credentials. If you keep reconnecting with the same expired token, you’ll create a tight loop that looks like a network bug but is really auth.
What heartbeat (ping/pong) timing should I use to prevent idle timeouts?
Start with a heartbeat every 15–25 seconds and consider the connection “dead” after 2–3 missed responses, then reconnect with backoff. The goal is to keep intermediaries from marking the connection idle, while not draining battery or data on mobile.
How do proxies or load balancers cause random WebSocket disconnects?
Anything that sits between the browser and your app can affect WebSockets: reverse proxies, CDNs, and load balancers must support HTTP upgrade, preserve needed headers, and allow long-lived connections. A common failure mode is an idle timeout around 30–60 seconds that kills “quiet” sockets unless you send heartbeats.
Do I need sticky sessions for WebSockets in production?
If you keep important state in memory (rooms, presence, subscriptions), a reconnect routed to a different instance can “connect” but not restore state, causing missing messages or failed checks. Either make servers stateless by moving state to shared storage/pub-sub, or ensure your infrastructure routes consistently when you truly need stickiness.
How do I stop duplicate messages after reconnects?
Use event IDs or sequence numbers and make both sides tolerant of retries. On reconnect, the client should resume from “last received,” and the server should process events idempotently so replaying the same message doesn’t double-charge, double-create records, or double-send notifications.
When should I use SSE or polling instead of WebSockets?
If the client mostly receives updates and doesn’t need two-way interactions, SSE is often simpler and behaves better through many proxies because it’s standard HTTP. For worst-case reliability, have a controlled fallback path so the app still works when sockets flap, even if updates arrive a bit slower.
When should I bring in FixMyMess to fix my real-time feature?
If your app is an AI-generated prototype and you’re stuck in a reconnect loop you can’t explain from logs, it’s usually faster to get a structured diagnosis than to keep tweaking timeouts. FixMyMess can audit the codebase, pinpoint whether the cause is auth, infrastructure, or scaling, and then repair the real-time flow so it holds up under real traffic.