Root Cause Analysis: WebSocket Errors During a Rainbow Deployment
I got pulled into an incident investigation last month triggered by a spike in WebSocket error logs from a session management service in production. The errors clustered in a 15-minute window and initially looked like three separate problems. They weren't. They all traced back to the same root cause: a deployment model change that introduced a distributed state consistency window nobody had designed for.
How I Investigated
First stop was Cloud Logging. The service runs in a GKE cluster on GCP, so I pulled Kubernetes container logs for the affected service in the zpc namespace:
gcloud logging read \
'resource.type="k8s_container" AND resource.labels.namespace_name="zpc" \
AND resource.labels.container_name="<service>" AND severity>=WARNING' \
--project=spg-zpc-p \
--freshness=2h \
--limit=500 \
--format=json
Successive queries drilled down by error message, then by specific session IDs to trace the full error sequence for a single connection lifecycle. This is the investigation pattern I use for production incidents: start wide, narrow by error class, then pick one example and trace it end to end.
Three distinct error patterns came out of that analysis.
Error Pattern 1: NullPointerException on WebSocket Connect
Eleven ERROR log entries. The sequence: device connects without a valid session cookie → repeated warnings about failing to get a unique ID from the session → terminal NullPointerException closing the session.
The code was attempting to look up a session identifier from the cookie before checking whether the cookie value was non-empty. Empty cookie → null lookup → null dereference. The fix is a null guard before any session lookup is attempted. This one was independent of everything else — a pre-existing bug that happened to appear in the same window.
Error Pattern 2: Redis State Mismatch — "Wrong Pod"
Session state was stored in Redis and shared across a cluster of pods. During the incident window, errors showed up like:
Session for <session-id> is stored on pod-A but current pod is pod-B
A client reconnected after a brief disconnect and landed on a different pod than the one that originally created its session. Under the previous deployment model, session affinity rules would have mitigated this. Something had changed.
Error Pattern 3: Redis State Mismatch — Null Value on CAS
A second Redis variant: session keys existed but their values had gone null. The service was doing a compare-and-swap delete — read the key, check the value, then conditionally delete. But between the read and the delete, another operation had already nulled out the value. Classic check-then-act race.
The Root Cause: Rainbow Deployment
Checking GitHub, PR #344 had been merged on March 11. It implemented a Rainbow deployment model — replacing the classic blue/green two-color approach with an N-color model:
- Deploy a new "color" alongside existing colors
- Run smoke tests against the new color in isolation
- Patch the Kubernetes service selector to the new color
- Drain and remove old colors
The transition window between the service selector patch and the full drain of old-color pods was the source of both Redis errors:
Pod mismatch: During the selector transition, new connections land on new-color pods while in-flight sessions are still stored on old-color pods. The session state in Redis is correct — the routing assumption is broken.
CAS race: Old-color pods draining are deleting session keys from Redis. New-color pods are reading those same keys. The window between a successful read and the CAS delete is wide enough for an old pod to null out the value.
These two errors are a direct consequence of the deployment model change. Blue/green doesn't have this problem because state belongs to exactly one color at a time. Rainbow deployments create a transition window where multiple colors are simultaneously active. Any service storing per-connection state in a shared backend needs explicit design for that window.
Fixes
Null guard on the empty cookie:
String cookieValue = getCookieValue(request, "session_id");
if (cookieValue == null || cookieValue.isEmpty()) {
closeSession(session, "Invalid or missing session cookie");
return;
}
Atomic CAS delete using Redis WATCH/MULTI/EXEC or a Lua script to eliminate the check-then-act race. At minimum, verify the value hasn't changed between the read and the delete:
String currentValue = redis.get(sessionKey);
if (currentValue != null && currentValue.equals(expectedValue)) {
redis.del(sessionKey);
}
Rainbow deployment session affinity: Hold old-color pods in a draining state where they continue serving existing sessions but accept no new connections. Or use consistent hashing for Redis key routing so session ownership is deterministic by session ID rather than pod assignment.
What I Sent to the Team
After completing the analysis, I sent the responsible engineer a direct message on Teams with the PR number, specific log examples for each error class, and the recommended fixes. Not a Jira ticket. Not a Slack thread. A direct message with full context, because that's what gets read.
The Broader Lesson
Rainbow deployments are genuinely better than blue/green for incremental rollouts and per-color smoke testing. But they shift the failure mode. The distributed state consistency problem they introduce won't be handled correctly by default — you have to design for it explicitly. If your service stores per-connection state in a shared backend, the transition window between selector patches is a window where correctness guarantees you had in blue/green no longer hold.
