Fanning Out a Multi-PR Rollout with Worktree-Isolated Subagents

I started the session expecting cleanup. Two prior PRs against an internal GCP sandbox cleanup pipeline at Business X — let's call the repo Product Y — had been closed unmerged. The descriptions claimed they'd landed key fixes. The team had moved on. I opened the repo to confirm and start the next round of work on top.

git merge-base --is-ancestor told a different story. None of the commits from PR #147 or PR #149 had reached origin/main. What had been merged was a partial state: compute#backendService was wired into the graph and the gcloud-script generator, but missing from the gather phase and missing from the allowlist that protects fresh resources from deletion. A reaper run in that state would walk the graph, see backend services it had no record of gathering, and emit deletion commands for resources the safety check could not protect.

That partial state is the story. The orchestration pattern I built around fixing it is the rest of the post.

The Audit That Reframed the Work

Before any planning, I read the two CLOSED-UNMERGED PR diffs end to end and cross-checked every commit SHA against origin/main:

for sha in 0228929c 71ec222c aa564aa4 75c01a27; do
  printf "%s: " $sha
  git merge-base --is-ancestor $sha origin/main 2>&1 \
    && echo on-main || echo not-on-main
done

Every line returned not-on-main. The PR descriptions had described an intended end state. The actual repo state matched neither the "before" nor the "after" cleanly — it was a third state, the half-wired one, that nobody had explicitly designed.

This is the failure mode that's hard to spot in passing review: a closed PR with thoughtful comments and a "we'll come back to this" feel. Months later, the codebase looks consistent at a glance. It isn't. The check that catches it is one shell loop and a willingness to distrust the PR description.

The Plan: Eight Follow-Up PRs, Ranked by Risk

I wrote a plan file enumerating eight follow-up PRs (PR-A through PR-H), each with a file:line edit map and a dependency graph. Some were independent. Some stacked. Three were intentionally deferred — gated on production validation runs, an OOM regression that hadn't yet recurred, or an investigation outcome.

Then I created a parent story in Jira plus eight subtasks (INFRA-5604 through INFRA-5612). Tickets are how this team tracks work; the Jira IDs would also become the branch names.

Subtask	Title	PR
INFRA-5605	PR-A: allowlist gaps + LB cleanup + expire `<=`	#152
INFRA-5606	PR-B: RDP zombie chain + cluster URL fix + resilient gather	#155 (stacked on PR-A)
INFRA-5607	PR-C: cleanup_tfstate.py replacement	#154
INFRA-5610	PR-F: deterministic gcloud script	#156
INFRA-5611	PR-G: empirical fresh-cluster TTL	#153
INFRA-5608 / 09 / 12	PR-D / E / H	deferred

PR-A First, In the Main Session

I did PR-A in the foreground. It was small enough to be done quickly, and it was the first one to establish the templates the rest of the work would copy: the commit-message format, the PR body structure, the dry-run verification block, the way Jira IDs got linked.

The point was not just to land PR-A. The point was to write the artifacts that the subagents could imitate without having to invent. Subagents are good at following templates and bad at inventing them — same as humans new to a codebase.

The Fan-Out: Four Subagents, Four Worktrees, In Parallel

PR-B, PR-C, PR-F, and PR-G were independent in their file edits. I spawned four subagents in a single message — each in its own git worktree on its own branch — and let them run in parallel.

The worktree isolation matters. Without it, four agents editing the same checkout step on each other's staged changes, blow away each other's working tree state, and produce a merged mess that no diff tool can untangle. With worktrees, each agent owns a directory; the underlying object database is shared but the working trees are separate.

The prompt template each agent received had four sections:

Context — what session this is, what PR-A established, the file:line edit map from the plan, the templates for commit messages and PR bodies.
Sibling PRs — what each of the other three agents is working on right now, and what file regions they're touching. This is the critical section. Without it, each agent acts as if it owns the whole repo.
Deferral protocol — explicit list of edits that look like they belong in this PR but actually belong to a sibling. Defer them with a comment in the PR body so the reviewer can verify the boundary.
Verification — run the repo-specific dry-run diff skill against the branch, post the comparison summary as a PR comment.

The deferral protocol earned its keep on PR-A. During its review iteration, both Copilot and a human reviewer suggested adding compute#targetTcpProxy to the allowlist. It was a reasonable suggestion in isolation. It was also explicitly PR-B's territory per the plan, and PR-B was already open with that exact entry in its diff. I deferred the suggestion back to PR-B with a one-line comment and a link to its diff.

The point of the deferral is not to be precious about scope. The point is that PR boundaries are how a reviewer keeps a multi-PR series legible. If every PR absorbs adjacent suggestions, the boundaries blur and the next reviewer can't tell what changed where.

Dry-Run Diff as the Merge Contract

For a cleanup pipeline, the most useful pre-merge artifact is not test results. It's a diff of what the pipeline would actually delete, before vs. after the change.

The repo has a skill that runs the dry-run pipeline twice — once on the PR branch, once on main — and diffs the resulting gcloud deletion scripts. The output is a structured comparison: commands added, commands removed, commands changed, plus context about which projects and resource types are affected.

I posted the dry-run comparison to PR-A as a comment after the initial implementation. Then, after the review-iteration amendment, I re-ran it. The two diffs were identical, which was the expected outcome — the review fixes were template/comment edits with no runtime behavior change. Posting both runs makes that explicit. A reviewer doesn't have to take it on faith.

The rule I wrote into memory after this session: every review iteration on a PR that touches the pipeline gets a fresh dry-run diff posted before merge. The diff itself is the merge contract.

The Second Wave

When the four subagents reported back, each PR had its own pile of review comments — some Copilot, some human. Rather than work through them serially, I spawned a second wave of four subagents in parallel, each one targeting one PR's review comments.

The second-wave prompts were lighter than the first. By this point, the templates existed, the sibling map was stable, and the agents had only to address specific comments on their assigned PR. The verification step was the same: re-run dry-run diff, post the comparison.

One agent finished cleanly while the other three were still working. The session ended with three review-fix runs in flight, captured in the handoff doc as known unknowns to verify in the next session. That's not a failure to wrap up — it's the honest state, written down.

What the Pattern Optimizes For

The orchestration pattern is not a productivity hack. It's an attempt to keep eight follow-up PRs internally consistent across a series, when each PR individually is small enough that the coordination overhead would otherwise dominate.

The expensive thing here is not the editing. It's the cross-PR reasoning: which entry belongs in which PR, which suggestion to defer, what dependency order the reviewer should merge in. That reasoning lives in the plan file and in the sibling-PR context that every subagent receives. The editing is mechanical once that's correct.

The cheap thing — and this is the whole reason for parallel subagents — is wall-clock time on the editing itself. Four worktrees, four agents, one message. The plan that took an hour to write powers four hours of mechanical edit work that completes in twenty minutes.

Pointing the Pattern at the Code-Review Workflow Itself

The same repo had a Claude-bot code-review GitHub Action firing on every push. It worked, mostly. During this rollout I logged five distinct failure modes in the span of one afternoon:

Silent re-runs. gh run rerun returned success and the workflow showed green, but no review comment appeared on the PR. The reviewer had no way to tell whether the bot had actually run or just exited early on a cache hit.
Stale-diff false [BLOCKER] on stacked PRs. The bot evaluated the GitHub-provided diff base and missed the post-rebase file state, flagging code that had already been corrected on the parent branch.
A four-reviews-in-sixteen-minutes duplicate storm from a single push wave. No debounce.
A SAST false positive (cycode flagging a fixed-seed RNG inside a test file) that the bot didn't pre-empt or contextualize.
No machine-readable merge-ready signal. Every reviewer parsed prose summary bodies for the literal string "Result: APPROVED".

So I pointed the orchestration pattern back at the workflow that was supposed to be reviewing my orchestration's output.

A Plan-mode subagent enumerated nine discrete improvements: externalize the prompt to a file, fingerprint each comment with the head SHA, make the bot stacked-PR-aware, debounce concurrent runs, expand triggers, add deferral vocabulary to the system prompt, pre-empt SAST findings, emit a machine-readable merge-ready check, and surface a per-run cost summary. The plan agent ran read-only and couldn't write the file itself — output had to be relayed by the parent. Worth a workflow tweak next round.

A second subagent in a worktree implemented the five highest-impact reliability fixes as one PR. The remaining four deferred to follow-up tickets, by design — the plan's "independently mergeable" property meant deferral was free.

What worked, recursively this time:

Fan-out parallelism held up. Four PRs implemented concurrently in worktrees, then four more subagents iterated on review comments concurrently against those PRs.
The full Context section in each subagent prompt did the heavy lifting. Parent ticket, sibling PRs, deferral protocol, verification command. Strip any one of those and subagents silently dropped scope that belonged to a sibling.
The dry-run diff skill ran on every iteration. It caught real semantic deltas (the targeted fix did exactly one thing, no incidental changes) and surfaced gather-stage flakiness as a pre-existing condition rather than a regression introduced by the PR.
git merge-base --is-ancestor cross-checking PR-description merge claims. It's the same five-minute check that uncovered the half-wired partial state at the start of the session. Two of the prior PRs claimed the work had landed; the ancestor check proved it hadn't.

What didn't:

Two subagents force-pushed onto stacked branches mid-iteration. Coordinated rebase-after-merge would have been cleaner; the diffs didn't overlap, so it happened to work.
The bot's silent reruns continued mid-rollout. Some PRs went un-re-reviewed despite the workflow firing. That's what triggered the recursive workflow improvement in the first place.
The auto-mode permission classifier denied legitimate zcli jira create calls that I had explicitly authorized. Had to add Bash(zcli jira create:*) to settings before tickets could land.

The takeaway from doing this twice in one session: multi-PR subagent fan-out scales when each subagent gets the full context that wraps its work — sibling PRs, deferral rules, repo conventions, verification command. Subagents without that context produce isolated work that ignores the broader rollout. The orchestrator's job is briefing, not implementing. I captured this as a global skill so the next rollout doesn't redo the design.

What I'd Do Differently

The handoff doc I wrote at session end is the artifact I most underinvested in early. Halfway through the second wave, I realized I wanted a single document a fresh reviewer could read to validate the whole series — what was planned, what was deferred and why, what was still in flight, which assumptions to re-check. I wrote it then. Next time it's the first artifact, not the last.

The second thing: the half-wired partial state from PRs #147 and #149 was discoverable in five minutes with git merge-base --is-ancestor. It went undiscovered for months because nobody ran that check. The check is now part of the orchestration skill — every multi-PR rollout starts with verifying the assumed baseline against origin/main, not against the prior PR descriptions.

The PR descriptions are a story about what was supposed to happen. The merge-base check is what actually did.