Chapter 12: Cost and latency
Treat cost and latency as first-class design constraints in the PR-reviewer agent using model tiering, prompt caching, batching, and early termination.
Xiaoman · The Counting House
Xiaoman is dependable now, but it grows pricey and sluggish. Today you teach it to live within its means.
Draft chapter. First cut to prove the format; it will be hardened before it is indexed.
What you’ll build
A version of the PR-reviewer agent that does the same job for less money and in less time. By the end you will have a router that picks a model per step, a cached prompt prefix, a batched fan-out over changed files, and a stop condition that ends the run the moment a verdict is ready. You will also have a cost-accounting sketch that prints what each review actually spent, so the next optimization is driven by numbers and not by hunches.
Here’s the one thing to keep straight. Every model call has two costs: the money (input plus output tokens, priced per model) and the time (the round trip, which dominates user-perceived latency). A quick-and-dirty agent treats both as free and pays for it once you scale up. A production agent treats them as a budget it has to spend carefully.
Prerequisites
- The working agent loop with token and timing logs from earlier chapters. If you cannot already see tokens-in, tokens-out, and wall-clock per call, build that first.
- The provider pricing page open in a tab. Treat prices as inputs you look up, not constants you memorize. Models and rates change; see the official pricing page for current numbers.
- A fixed evaluation set: ten to twenty real PRs with known outcomes. This is the ruler you measure every change against.
Steps
1. Establish a baseline
You can’t improve what you haven’t measured. Run the agent over your fixed PR set and record four numbers per review: input tokens, output tokens, wall-clock seconds, and number of model calls (tool-call rounds). Sum them and divide to get per-review averages. Keep this in a small table you can re-run after every change. The Microsoft production lesson makes the same point: token cost per run and latency are the numbers that let you see inside the agent instead of treating it as a black box.
import time
import anyio
from claude_agent_sdk import query, ClaudeAgentOptions, ResultMessage
# Cost-accounting sketch. You do not price tokens yourself: the ResultMessage
# hands you this run's total cost, turn count, and token usage (check the
# official docs for the exact usage field names).
async def review_with_cost(pr_id, diff, options):
start = time.monotonic()
async for message in query(prompt=f"Review this diff:\n{diff}", options=options):
if isinstance(message, ResultMessage): # terminal: cost and usage are here
elapsed = time.monotonic() - start
print(f"review={pr_id} turns={message.num_turns} "
f"cost=${message.total_cost_usd:.4f} secs={elapsed:.1f} "
f"usage={message.usage}")
anyio.run(review_with_cost, pr_id, diff, options)
import { query } from "@anthropic-ai/claude-agent-sdk";
// Cost-accounting sketch. You do not price tokens yourself: the result message
// hands you this run's total cost, turn count, and token usage.
async function reviewWithCost(prId, diff, options) {
const start = Date.now();
for await (const message of query({ prompt: `Review this diff:\n${diff}`, options })) {
if (message.type === "result" && message.subtype === "success") {
const secs = (Date.now() - start) / 1000;
console.log(`review=${prId} turns=${message.num_turns} ` +
`cost=$${message.total_cost_usd.toFixed(4)} secs=${secs.toFixed(1)}`,
message.usage);
}
}
}
2. Tier your models
Not every step needs the strongest model. Use a strong model for the final review judgment, where missing a real bug is expensive, and a cheaper, faster model for mechanical sub-tasks: classifying which files changed, summarizing a diff hunk, deciding whether a file is even worth reading. This is the “router model” pattern from the production lesson: route by complexity, reserve the large model for genuine reasoning. The savings compound because the cheap-model steps usually run many times per review while the strong-model step runs once.
from claude_agent_sdk import ClaudeAgentOptions
# Routing table: which model handles which step.
# model accepts the aliases "haiku" / "sonnet" / "opus".
MODEL_FOR = {
"triage_changed_files": "haiku", # cheap, runs once
"summarize_hunk": "haiku", # cheap, runs per file
"is_file_worth_reading": "haiku", # cheap, runs per file
"final_review_verdict": "opus", # strong, runs once
"security_deep_dive": "opus", # strong, only if triage flags risk
}
def options_for(step):
return ClaudeAgentOptions(model=MODEL_FOR[step], allowed_tools=["Read"])
// Routing table: which model handles which step.
// model accepts the aliases "haiku" / "sonnet" / "opus".
const MODEL_FOR = {
triage_changed_files: "haiku", // cheap, runs once
summarize_hunk: "haiku", // cheap, runs per file
is_file_worth_reading: "haiku", // cheap, runs per file
final_review_verdict: "opus", // strong, runs once
security_deep_dive: "opus", // strong, only if triage flags risk
};
function optionsFor(step) {
return { model: MODEL_FOR[step], allowedTools: ["Read"] };
}
The tradeoff: a cheap model on triage can misjudge a file as low-risk and skip it. Guard against this by making triage err toward escalation (when unsure, send it to the strong model) and by keeping the strong-model security pass on any file the cheap model flagged.
3. Cache the stable prefix
Your system prompt, tool schemas, and review rubric are identical on every call. Without caching you pay full input price to re-process them on every turn of every review. Prompt caching lets you pay a one-time write, then read the same prefix back at a deep discount on later calls within the cache lifetime. With the Agent SDK you do not mark a cache_control breakpoint by hand: put the stable content in system_prompt, and the CLI caches that prefix for you, while the per-PR diff goes through prompt.
There is just one rule: stable content in system_prompt, per-PR content in prompt. Mix volatile content into system_prompt and the prefix stops matching, so you never get a hit. To confirm a hit, check whether the cache-read field in ResultMessage.usage is non-zero (see the docs for the exact field name).
from claude_agent_sdk import query, ClaudeAgentOptions, ResultMessage
options = ClaudeAgentOptions(
system_prompt=REVIEW_RUBRIC, # stable prefix: the CLI caches it
allowed_tools=["Read"],
model="opus",
)
async for message in query(
prompt=f"Review this diff:\n{pr_diff}", # changes every PR
options=options,
):
if isinstance(message, ResultMessage):
print(message.usage) # check the cache-read field is non-zero to confirm a hit
import { query } from "@anthropic-ai/claude-agent-sdk";
const options = {
systemPrompt: REVIEW_RUBRIC, // stable prefix: the CLI caches it
allowedTools: ["Read"],
model: "opus",
};
for await (const message of query({
prompt: `Review this diff:\n${prDiff}`, // changes every PR
options,
})) {
if (message.type === "result" && message.subtype === "success") {
console.log(message.usage); // check the cache-read field is non-zero to confirm a hit
}
}
Two operational notes. The default cache lifetime is short (5 minutes, refreshed on each hit); a "ttl": "1h" option exists at a higher write cost for prefixes reused less often than every few minutes. And there is a minimum cacheable length (on the order of ~1k tokens for current mid and large models): below it, caching is silently skipped. Verify a hit by checking that cache_read_input_tokens is non-zero. See the prompt-caching docs for exact thresholds and fields.
4. Batch independent work
If you review ten files with no dependencies between them, do not do ten sequential round trips. Latency on sequential calls is the sum of every round trip; on concurrent calls it is bounded by the slowest one. Issue the per-file summaries and risk checks as a concurrent fan-out, then gather them for the single final verdict. For a multi-file PR, this does more for wall-clock time than anything else here.
import asyncio
from claude_agent_sdk import query, ClaudeAgentOptions, AssistantMessage, TextBlock
async def summarize_and_flag(f):
opts = ClaudeAgentOptions(model="haiku", allowed_tools=["Read"])
out = ""
async for message in query(prompt=f"Summarize and flag risk in {f}", options=opts):
if isinstance(message, AssistantMessage):
for block in message.content:
if isinstance(block, TextBlock):
out += block.text
return out
async def review_pr(files):
# fan out the cheap, independent per-file work concurrently: one query() per file
findings = await asyncio.gather(*[summarize_and_flag(f) for f in files]) # haiku, parallel
# one strong-model call sees all the gathered context
return await final_verdict(findings) # opus, once
import { query } from "@anthropic-ai/claude-agent-sdk";
async function summarizeAndFlag(f) {
const opts = { model: "haiku", allowedTools: ["Read"] };
let out = "";
for await (const message of query({ prompt: `Summarize and flag risk in ${f}`, options: opts })) {
if (message.type === "assistant") {
for (const block of message.message.content) {
if (block.type === "text") out += block.text;
}
}
}
return out;
}
async function reviewPr(files) {
// fan out the cheap, independent per-file work concurrently: one query() per file
const findings = await Promise.all(files.map(summarizeAndFlag)); // haiku, parallel
// one strong-model call sees all the gathered context
return finalVerdict(findings); // opus, once
}
Mind the rate limits while you fan out: a bounded concurrency (a semaphore around the gather) keeps you from tripping the provider’s per-minute caps. We tighten this in the deployment chapter.
5. Terminate early and cap rounds
The moment the agent has enough to render its verdict, stop. Two failure modes waste budget here: the agent keeps reading files it does not need, and the agent loops (re-reading, re-thinking) without converging. Set an explicit stop condition and a hard cap on tool-call rounds. The production lesson calls infinite loops out by name; a clear termination condition is the fix.
from claude_agent_sdk import query, ClaudeAgentOptions, ResultMessage
options = ClaudeAgentOptions(
system_prompt=REVIEW_RUBRIC,
allowed_tools=["Read"],
max_turns=6, # hard round cap: a looping run stops at the ceiling
max_budget_usd=0.50, # budget cap: overshooting returns error_max_budget_usd
)
async for message in query(prompt=f"Review this diff:\n{pr_diff}", options=options):
if isinstance(message, ResultMessage):
if message.subtype == "error_max_budget_usd":
# hit the cap without converging: flag for human
handle_capped(message, reason="budget cap")
# done: it wraps up on its own once it has a verdict, no padding
import { query } from "@anthropic-ai/claude-agent-sdk";
const options = {
systemPrompt: REVIEW_RUBRIC,
allowedTools: ["Read"],
maxTurns: 6, // hard round cap: a looping run stops at the ceiling
maxBudgetUsd: 0.50, // budget cap: overshooting returns error_max_budget_usd
};
for await (const message of query({ prompt: `Review this diff:\n${prDiff}`, options })) {
if (message.type === "result") {
if (message.subtype === "error_max_budget_usd") {
// hit the cap without converging: flag for human
handleCapped(message, "budget cap");
}
// done: it wraps up on its own once it has a verdict, no padding
}
}
Learned: spending wiselyXiaoman no longer reaches for the top model on every step. It routes the mechanical work to a cheap model, caches the stable prefix, fans the independent files out in parallel, stops the moment it has a verdict, and knows what each review cost in tokens and seconds.
How to verify
Re-run the baseline set after each change and compare the table. Confirm that per-review tokens and wall-clock drop, and that the call count for the strong model stays at roughly one. Critically, confirm review quality on a held-out PR set does not regress: a cheaper agent that misses real bugs is not cheaper, it is broken. For caching specifically, assert cache_read_input_tokens > 0 on the second call of a review; if it is zero, your prefix is changing or is below the minimum length.
Learned: letting numbers decideYou can now confirm on a held-out PR set that the savings did not cost you quality. Re-run the baseline table after each cut and check whether tokens and latency actually dropped, the cache hit, and no real bug slipped through.
Why it works
The four techniques each hit a different part of cost and latency, and you can stack them without one getting in another’s way. Tiering cuts the price per step. Caching cuts the price of the repeated prefix. Batching cuts wall-clock without changing total tokens. Early termination cuts the number of steps. None of them trade quality for savings when applied as described, because each one removes work that did not contribute to the verdict in the first place.
Recap
Cost and latency are design constraints, not afterthoughts. Measure a baseline, route cheap work to a cheap model, cache the constant prefix, parallelize the independent steps, and stop the instant you have a verdict. Re-measure after every change, and let the numbers decide what to do next instead of going on a hunch.
Common pitfalls
- Caching content that changes every call (a timestamp or per-PR context in the cached block), so the hash never matches and you never get a hit.
- Using the top model for trivial classification because it was already wired in.
- Optimizing tokens while quietly degrading the review; always re-check the held-out set.
- Unbounded concurrent fan-out that trips the provider’s rate limit and turns a speed win into a wall of 429s.
- Hardcoding prices as facts in code or prose; rates change, so look them up on the pricing page.
Xiaoman learns to spend carefully, even volunteering, this task is not worth the big model, the small one will do. It starts to act like someone who runs a household. The Counting House lights up.
Just lit The Counting House · 13 / 16 lit
Sources
- Anthropic docs: prompt caching · official
- Microsoft: AI Agents in Production (cost management) · official