The Pause That Matters: Evidence-Based LLM Incident Investigation
Today I investigated a production incident with an LLM as my primary investigation partner. The LLM made eight verifiable errors. All eight were caught. This is a post about that — about what evidence-based LLM incident investigation actually looks like when you practice it honestly, and about the moment partway through when I realized I needed to change how I was working.
The incident: ~560K concurrent WebSocket connections from a device fleet to the device-ws service. At 13:56 UTC, ~292K of them — 49% — disconnected simultaneously across all tenants. Recovery took 21 minutes. Root cause confirmed via CI build logs and git history. Three infrastructure changes collided in a 75-minute window.
What the LLM got right
Before the errors: the LLM earned its place in this investigation.
The cascade mechanism. I described the system topology and the failure shape, and the LLM modeled the amplification correctly without being told the architecture explicitly: gateway pod termination → reconnect storm → Redis ownership contention → health probe failure → HAProxy reload → connection reset → repeat. That's not an obvious chain. It got there.
The HAProxy cookie inference was more impressive. I showed it the production legacy gateway config — cookie SERVERID insert indirect nocache dynamic — and it correctly reasoned that the firmware persists cookies across reconnects. From there it concluded that the cold-start problem wasn't "firmware ignores cookies" but "no cookie exists yet to persist." That required reasoning from an existing system's config to infer firmware behavior it had never seen. It was right.
It also got the stateless vs. stateful routing distinction without prompting: consistent hash on serial number (any pod can route correctly given only the request) vs. cookie-based (pod needs session state it doesn't have on cold start). And it correctly predicted the Envoy CONNECT proxy dead end — an HTTP forward proxy won't see headers inside a TLS tunnel established via HTTP CONNECT — before we confirmed it.
These were genuine architectural reasoning tasks. The LLM handled them.
The trigger
Early in the investigation I asked for per-tenant connection metrics. The LLM said it could only provide an aggregate — all tenants combined.
I already knew that was wrong. Prior manual investigation with the team had surfaced per-tenant metrics. The gap between what the LLM believed was possible and what I knew was real clicked something in. "Trust, but verify" stopped being a phrase and became the operating principle for everything that followed.
The important thing: catching that error required prior domain knowledge. The skeptic reflex only fires if you have something to be skeptical with. An engineer without the team context would have accepted the all-or-nothing answer and moved on. Domain knowledge isn't a nice-to-have for LLM-assisted investigation — it's the mechanism by which you catch fabrications. No foundation, no catches.
The people-pleaser problem
About 30 minutes in I stepped away. When I came back I saw it clearly: I had been feeding the LLM hypotheses, and the LLM — being a people-pleaser — had been confirming them.
I realized the LLM is as much a people-pleaser as I am, and I need to keep my skeptic hat on.
I'm a people-pleaser too. Two people-pleasers in a room don't get to the truth faster — they agree themselves around it.
The shift I made: stop providing the conclusion and asking for validation. Go to primary sources, bring the real data back, use the LLM to interpret against live system data. The LLM becomes an interpretation layer on top of facts it didn't generate, not a hypothesis engine operating on nothing.
The prompting discipline
Five patterns that held for the rest of the investigation:
"Confirm that..." — imperative framing, not a request. Forces the LLM to treat the claim as something to be verified before it lands, not something to elaborate on.
Conditional guardrails — "only if you can determine that from metrics." Explicit: if you'd have to guess, don't. This surfaces uncertainty the LLM would otherwise smooth over.
Primary source anchoring — "we make the hardware and the firmware." Signals that the LLM's inference here is worth nothing; I have actual knowledge. It changes how the LLM scopes its confidence.
Multi-layer skepticism — checking Terraform, Helm, and Kubernetes separately. Each layer can mask or override the previous one. The top-level view is not trustworthy on its own.
Tool-directed verification — "use the CLI tool," "go look at the live system." Forces current-state checks rather than reasoning from training data or the session context I'd provided.
None of these are novel. They're just discipline — applied consistently after the walk-away made the pattern visible.
The prerequisite
The LLM could navigate to the right CI job, find the right build, parse the right 4000-line console log because I'd spent months building its working memory of the entire infrastructure — CI/CD topology, cluster names, job hierarchies, interdependent systems, which team owns what. That holistic environment model wasn't built in this session.
The numbers, verified from git history and session logs: 30 architecture documents (259KB of infrastructure knowledge) built over a month — GKE workloads, Jenkins pipelines, Terraform layouts, gateway configuration, monitoring, sandbox lifecycle. Then 16 working sessions across four weeks of active engineering in the codebase: 1,527 turns, 23 hours of wall-clock time. The incident happened in week five.
Without that foundation, "find the right CI build for commit 3d734f03" is a dead end. With it, it's a 90-second lookup. The documentation phase wasn't optional scaffolding — it was the price of admission.
The eight errors
Here's the honest accounting. Every error was caught.
| # | What the LLM said | Ground truth | How caught |
|---|---|---|---|
| 1 | INFRA-1042 deployed ~13:07 UTC (commit time) | Deployed 13:52:28 UTC — 45 min later when CI ran Terraform | CI build console log |
| 2 | Fix "improved throughput by ~26×" | No such calculation exists. Invented. | Second-pass review agent |
| 3 | "Revert confirmed INFRA-1042 as root cause" | Revert also re-disabled the gateway layer entirely — confounded the conclusion | Second-pass review agent |
| 4 | Commit 3d734f03 was dated "March 6" |
March 4 | git log --format='%ai' |
| 5 | Timeline: RECOVERY → SIGKILL → STABLE | SIGKILL at 14:17:48 occurred during recovery, not after | Second-pass review agent |
| 6 | PR #417 tagged INFRA-1039 | INFRA-1039 belongs to PR #414. PR #417 has no ticket. | gh pr view 417 |
| 7 | "Lua infrastructure deployed 45 min before the app change" | Both deployed in same Terraform apply, 1 second apart | CI build console log |
| 8 | "This was my first time using an LLM for infrastructure work" | False. I do infrastructure work with LLM assistance daily. | I read the draft |
Errors 1 and 7 share a root cause: the LLM treated git commit timestamp as deployment timestamp. They're not the same thing. A commit merged at 13:07 doesn't reach production until CI picks it up, Terraform plans it, approval gates pass, and the apply runs. That gap was 45 minutes here. It's a category error — not a random hallucination — and it produced two wrong claims because it applied the same wrong model in two places.
Error 2 has the most dangerous texture. "~26× throughput improvement" sounds like something someone calculated. It has the shape of a real finding. It's completely fabricated.
Error 8 is the most instructive. It happened while drafting this post — the LLM wrote that this investigation was my first time using an LLM for infrastructure work. It isn't. The claim was plausible, contextually coherent, and wrong. I caught it because I know my own history. The discipline doesn't stop at incident reports.
Why post-mortems are the risky case
In code review, a fabricated method name fails to compile. The error surfaces before it matters.
In incident investigation, fabricated specifics are plausible. "~26× improvement" sounds like something someone calculated. "March 6" sounds like a date someone looked up. "45 minutes before" sounds like a meaningful lead time. The failure mode doesn't announce itself.
Post-mortems are uniquely risky because their purpose is establishing ground truth. An LLM that generates a plausible but wrong timeline produces documentation that needs correction before it becomes the official record. Official records are what future engineers use to avoid the same incident. A wrong post-mortem doesn't just fail to help — it actively sets up the next failure.
The workflow
What replaced "ask and trust":
- LLM proposes a claim
- Identify the authoritative source
- Fetch it
- Compare
- Correct and continue
The iteration tax is real. Getting into a production CI instance, navigating job hierarchies, parsing console output — that's not a 30-second check. Over the course of this investigation it was hours of verification loops.
But the result is a report where every timestamp is pinned to a log line, every causal claim is defensible, and nothing was fabricated. Seven errors caught in the incident investigation. One caught in the draft of this post. None in the final record.
That's the workflow. It's not clever. It's just the one that works.