Chapter 6: Subagents and orchestration (and when not to)
Split the PR review across parallel subagents, and learn the failure modes that make a single agent the better choice.
Xiaoman · The Hall of Doubles
There is more work than one Xiaoman can carry. Today it learns to send out doubles.
Draft chapter. First cut to prove the format; it will be hardened before it is indexed.
What you’ll build
You will take the single pr-reviewer from Part 1 and split it into an orchestrator plus three subagents: a security lane, a performance lane, and a tests lane. The orchestrator slices a large diff, fans the slices out, enforces a token budget per lane, and merges three partial reviews into one deduplicated report. Then you will run the exact same PR through the single agent and the multi-agent version, line up the cost and the quality side by side, and apply a written decision rule for when fan-out is worth it. The deliverable is not “a multi-agent system.” It is the judgment to know, for a given PR, whether you need one.
Microsoft’s multi-agent lesson lists a few real advantages: specialization (each agent does one thing instead of becoming a confused generalist), scalability (adding a lane is easier than overloading one prompt), and fault tolerance (one lane can fail without sinking the run). But every one of those advantages has a matching cost, and most PRs do not earn the trade.
Prerequisites
- Chapter 5 finished: a composable
pr-reviewerskill with a defined scope and least-privilege permissions. - A PR large enough that distinct concerns genuinely separate (think 800+ changed lines touching auth, a hot loop, and a test suite at once). On a 30-line PR there is nothing to fan out.
- Per-run token observability so you can add up cost across lanes. See your provider’s usage logging, see the official docs.
Steps
1. Decide first: does this PR even need fan-out?
Before writing any orchestration, run the decision rule below against the PR. Fan-out pays off only when work is genuinely independent and large enough that the coordination tax is small relative to the parallel gain. If two lanes would read the same files and reason about the same lines, they are not independent, and you will pay double for two overlapping opinions.
USE FAN-OUT only if ALL hold:
- diff_size > ~600 changed lines, OR > ~8 files across distinct concerns
- lanes touch mostly disjoint files (security != perf != tests)
- a single-agent run already hits context limits or times out
- per-lane findings do NOT depend on another lane's findings
OTHERWISE use one agent. One agent is the default, not the fallback.
2. Define one narrow job per subagent
Each subagent gets a single responsibility, the smallest slice of the diff it needs, and nothing else. Overlap is the thing to avoid: if the security lane and the tests lane both inspect the same file, you pay twice and then have to reconcile two opinions at merge time. Write the lane prompts so each one is told explicitly what to ignore.
# security lane
ROLE: Review ONLY for security defects in the provided diff slice.
SCOPE: files = [auth/*, api/handlers/*]
LOOK FOR: injection, authz bypass, secret leakage, unsafe deserialization.
IGNORE: style, performance, test coverage. Another agent owns those.
OUTPUT: JSON list {file, line, severity, finding, suggested_fix}
# perf lane
ROLE: Review ONLY for performance defects.
SCOPE: files = [core/engine/*, db/queries/*]
LOOK FOR: N+1 queries, unbounded loops, blocking I/O on hot paths.
IGNORE: security, style. OUTPUT: same JSON schema.
# tests lane
ROLE: Review ONLY whether changes are adequately tested.
SCOPE: files = [**/*_test.*, src/** that lacks matching tests]
LOOK FOR: untested branches, missing edge cases, deleted assertions.
OUTPUT: same JSON schema.
3. Define each lane as a subagent and let the lead agent fan out
In the Claude Agent SDK you do not write a concurrent scheduler yourself. You declare each lane as an AgentDefinition and attach it via options.agents; then you write a lead (orchestrator) system prompt whose only tool is Task. Given a large diff, the lead agent uses Task to dispatch the lanes in parallel, each subagent running in its own isolated context window with only the files in its declared scope. The description field matters: it is what the lead uses to decide when to dispatch which lane. Use max_turns as a per-lane guardrail so a confused lane stops at the cap.
import anyio
from claude_agent_sdk import (
query, ClaudeAgentOptions, AgentDefinition,
AssistantMessage, TextBlock, ResultMessage,
)
LANES = {
"security-lane": AgentDefinition(
description="Use to review a diff for security defects. Touches only auth and api/handlers files.",
prompt="Review ONLY for security defects: injection, authz bypass, secret "
"leakage, unsafe deserialization. Ignore style, performance, tests. "
"Output a JSON list {file, line, severity, finding, suggested_fix}.",
tools=["Read", "Grep"], model="sonnet",
),
"perf-lane": AgentDefinition(
description="Use to review a diff for performance defects. Touches only core/engine and db/queries files.",
prompt="Review ONLY for performance defects: N+1 queries, unbounded loops, "
"blocking I/O on hot paths. Ignore security, style. Same JSON schema.",
tools=["Read", "Grep"], model="sonnet",
),
"tests-lane": AgentDefinition(
description="Use to review whether changes are adequately tested. Touches test files and src lacking tests.",
prompt="Review ONLY test coverage: untested branches, missing edge cases, "
"deleted assertions. Same JSON schema.",
tools=["Read", "Grep"], model="sonnet",
),
}
LEAD = """You are a PR-review orchestrator. Your only tool is Task.
You NEVER review code yourself. Partition the diff by file path into disjoint
lanes, dispatch security-lane / perf-lane / tests-lane in parallel, then collect
their JSON findings verbatim. Never let two lanes read the same files."""
options = ClaudeAgentOptions(
system_prompt=LEAD,
agents=LANES,
allowed_tools=["Task"], # the lead can only dispatch; it cannot read or write
max_turns=8,
cwd="./repo",
)
async def main():
async for message in query(
prompt="Review the large diff under src/payments/, fanning out by lane.",
options=options,
):
if isinstance(message, AssistantMessage):
for block in message.content:
if isinstance(block, TextBlock):
print(block.text)
elif isinstance(message, ResultMessage):
print(f"usage: {message.usage} cost: ${message.total_cost_usd or 0:.4f}")
anyio.run(main)
import { query } from "@anthropic-ai/claude-agent-sdk";
const LANES = {
"security-lane": {
description: "Use to review a diff for security defects. Touches only auth and api/handlers files.",
prompt:
"Review ONLY for security defects: injection, authz bypass, secret " +
"leakage, unsafe deserialization. Ignore style, performance, tests. " +
"Output a JSON list {file, line, severity, finding, suggested_fix}.",
tools: ["Read", "Grep"],
model: "sonnet",
},
"perf-lane": {
description: "Use to review a diff for performance defects. Touches only core/engine and db/queries files.",
prompt:
"Review ONLY for performance defects: N+1 queries, unbounded loops, " +
"blocking I/O on hot paths. Ignore security, style. Same JSON schema.",
tools: ["Read", "Grep"],
model: "sonnet",
},
"tests-lane": {
description: "Use to review whether changes are adequately tested. Touches test files and src lacking tests.",
prompt:
"Review ONLY test coverage: untested branches, missing edge cases, " +
"deleted assertions. Same JSON schema.",
tools: ["Read", "Grep"],
model: "sonnet",
},
};
const LEAD = `You are a PR-review orchestrator. Your only tool is Task.
You NEVER review code yourself. Partition the diff by file path into disjoint
lanes, dispatch security-lane / perf-lane / tests-lane in parallel, then collect
their JSON findings verbatim. Never let two lanes read the same files.`;
const q = query({
prompt: "Review the large diff under src/payments/, fanning out by lane.",
options: {
systemPrompt: LEAD,
agents: LANES,
allowedTools: ["Task"], // the lead can only dispatch; it cannot read or write
maxTurns: 8,
cwd: "./repo",
},
});
for await (const message of q) {
if (message.type === "assistant") {
for (const block of message.message.content) {
if (block.type === "text") console.log(block.text);
}
} else if (message.type === "result") {
console.log("usage and cost:", message.usage, message.total_cost_usd);
}
}
4. Merge deliberately: deduplicate and reconcile conflicts
The merge step is where multi-agent systems tend to break down. Subagents never saw each other’s reasoning, so they produce duplicates (two lanes both flag the same line for different reasons) and sometimes contradictions (perf wants caching, security flags the cache as a data-leak vector). This step is not an SDK call; it is plain code you run over the structured findings the lanes returned. Because lanes emit JSON rather than prose, the merge can be mostly mechanical, with the model called only for genuine conflicts. The function below takes the collected findings list, deduplicates, reconciles, and sorts by severity.
def merge(findings):
by_loc = group_by(findings, key=lambda f: (f.file, f.line))
report = []
for loc, group in by_loc.items():
if len(group) == 1:
report.append(group[0])
else:
# same line, multiple lanes: keep highest severity,
# and if recommendations conflict, escalate to the orchestrator model
if conflicting_fixes(group):
report.append(resolve_conflict(loc, group)) # 1 small LLM call
else:
report.append(highest_severity(group))
return sorted(report, key=lambda f: f["severity"], reverse=True)
function merge(findings) {
const byLoc = groupBy(findings, (f) => `${f.file}:${f.line}`);
const report = [];
for (const group of Object.values(byLoc)) {
if (group.length === 1) {
report.push(group[0]);
} else {
// same line, multiple lanes: keep highest severity,
// and if recommendations conflict, escalate to the orchestrator model
if (conflictingFixes(group)) {
report.push(resolveConflict(group)); // 1 small LLM call
} else {
report.push(highestSeverity(group));
}
}
}
return report.sort((a, b) => severityRank(b) - severityRank(a));
}
5. Compare against the single agent on the same PR
Run the Chapter 5 single-agent reviewer on the identical PR. Capture total tokens, wall-clock latency, and the finding set for both. Now you can answer the only question that matters: did fan-out find real defects the single agent missed, or did it just spend three to four times the tokens to produce a noisier version of the same review?
single agent multi-agent (3 lanes + merge)
tokens ~38k ~140k (~3.7x)
latency ~22s ~31s (parallel, but merge adds a hop)
true findings 9 11 (+2 real, both in the perf lane)
duplicates 0 4 (reconciled at merge)
verdict: fan-out was worth it ONLY because the perf lane went deep
on a hot path the single agent skimmed. On a smaller PR, no.
Learned: send a copy, or hold backXiaoman can slice a big diff into disjoint lanes and fan them out to several subagents at once. The harder part: it now runs the decision rule first to ask whether this PR even earns the split, and most of the time the answer is no.
How to verify
- Confirm each subagent stayed in its lane: grep its output for findings outside its declared scope. Any leakage means your IGNORE instructions are too weak.
- Sum tokens across all lanes plus the orchestrator’s merge call, and put that number next to the single-agent run. If multi-agent is not at least matching quality, the extra spend is pure waste.
- Inspect the merged report for surviving duplicates or unreconciled contradictions. Either one means the merge step (step 4), not the lanes, is your weak link.
- Kill one lane on purpose (return an error). The run should still produce a partial report. If it crashes, you have not actually bought fault tolerance.
Learned: counting the cost of a splitYou can now put the multi-agent run's total tokens next to the single-agent run and kill a lane on purpose to see if it degrades to a partial report, which tells you whether fan-out really saved work or just spent three to four times as much for a noisier version of the same review.
Why it works
Each subagent runs in its own context window, so a lane that only reads the auth files never spends tokens reasoning about the test suite. That is what actually makes “specialization” work: not that the model is smarter per lane, but that each lane’s context is smaller, cleaner, and more relevant. The context-engineering lesson from Chapter 3 was about one agent; here it moves across processes instead. Fan-out also turns a sequential read of a huge diff into concurrent reads of slices, which is where the latency win comes from when slices are truly disjoint.
But none of this is free. The orchestrator and merge are pure overhead: tokens and a round-trip you would not pay with one agent. You also lose the cross-cutting reasoning a single agent gives you for free, the kind where noticing an auth change makes it look harder at a related query. Microsoft’s lesson says it plainly: multi-agent only pays off on “complex tasks” that “break down into specialized subtasks.” A medium PR is not that task.
Recap
Subagents help when work is genuinely independent, the diff is large enough that coordination overhead is a rounding error, and a single agent is already hitting context or latency walls. They hurt when concerns overlap (you pay double for redundant opinions), when context is lost at the merge (duplicates and contradictions slip through), or when token cost balloons past the quality gain. Default to one agent; reach for many only when the decision rule in step 1 says yes and the comparison in step 5 proves it. The next chapter rebuilds this same PR-reviewer in OpenAI Codex and pulls out what transfers across vendors.
Common pitfalls
- Multi-agent by default. The most common mistake is reaching for fan-out because it looks sophisticated. Most PRs do not need it; the coordination cost usually outweighs the parallel gain. One agent is the default, not the fallback.
- Overlapping lanes. If two subagents read the same files, you pay twice and inherit a reconciliation problem. Slice by disjoint paths and write explicit IGNORE instructions.
- Lost context at the merge. Subagents never share reasoning. Make them emit structured findings and reconcile conflicts deliberately, or duplicates and contradictions will leak into the report.
- Cost blowup with no budget. A confused lane can loop and drain your spend. Set a per-lane token cap, reserve headroom for the merge, and enforce a global ceiling.
- No fault tolerance, despite the diagram. If one failing lane crashes the whole run, you have the cost of multi-agent without its main resilience benefit. Let the orchestrator degrade to a partial report.
Xiaoman sends a double to run an errand for the first time, and for the first time misjudges whether to, turning a simple task into a tangled one. It learns the harder lesson: when not to split. The Hall of Doubles lights up.
Just lit The Hall of Doubles · 7 / 16 lit