Chapter 3: Context engineering
Treat the context window as a budget and decide deliberately what goes in and what stays out.
Xiaoman · The Memory Maze
Xiaoman reads more and more, and its head is nearly full. Today it must learn what to keep and what to drop.
Draft chapter. First cut to prove the format; it will be hardened before it is indexed.
What you’ll build
A context strategy for the PR reviewer from Chapter 2. The loop you wrote keeps appending messages: the system prompt, the task, every tool call, and every tool result. Leave it alone and it grows without bound, and worse, growing this way makes the model worse at its job. In this chapter you build an assemble_context function that treats the window as a fixed budget, a compact step that summarizes old turns, and a small memory file the reviewer reads back after a reset. What you end up with is a reviewer that can get through a 40-file pull request without losing track of what it is doing.
The core idea comes straight from Anthropic’s docs: context is a finite resource, and the more you add, the less each token buys you. The model has an “attention budget” that gets spread thinner as the token count climbs. Adding more tokens is not free, and often it actively hurts.
Prerequisites
- Chapter 2 finished: a working minimal loop with a
read_filetool and a running message list. - A rough sense of your model’s context window size and the cost per input token. See the official docs for your model.
- Knowing that more tokens is not better. Relevant tokens are better. This chapter explains why.
Why it works
Transformers attend pairwise: every token can look at every other token, so the attention work scales with the square of the sequence length. As the prompt grows, that fixed attention budget gets split across more relationships, and the model gets worse at picking out the one line that matters. The phenomenon has a name now: “context rot,” the measured drop in retrieval and reasoning accuracy as input length rises. The practical takeaway is blunt: a 5,000-token prompt with exactly the right diff beats a 90,000-token prompt that buries the same diff under forty unrelated files. Your job is not to fill the window. It is to find the smallest set of high-signal tokens that lets the model get the next step right.
Steps
1. Name the budget and split it
Pick a working ceiling well below the hard context limit, then cut it into named slices. Leave room for the model’s reply (output tokens count too) and a safety margin so one large tool result cannot blow past the request limit. Writing the budget down turns a vague “don’t use too much” into a number you can assert against in tests.
Context budget for the PR reviewer (illustrative, 200k window)
+-----------------------------+----------+-----------------------------------+
| Slice | Budget | Notes |
+-----------------------------+----------+-----------------------------------+
| System prompt + tool specs | 2,000 | Stable, written once |
| Task + PR metadata | 1,000 | Pinned, survives compaction |
| Memory / findings-so-far | 2,000 | Rolling notes, re-read on reset |
| Diff under review | 20,000 | Retrieved in slices, not whole |
| Retrieved file context | 15,000 | Only files the diff touches |
| Recent tool-call transcript | 40,000 | Last N turns verbatim |
| -- compaction threshold -- | 80,000 | Compact when transcript crosses it|
| Reserved for model output | 16,000 | Never spend this |
+-----------------------------+----------+-----------------------------------+
2. Tier the content and fill top-down
Sort everything that could go in the prompt into three tiers. Must-have: the task statement and the diff hunks under review. Useful: the definitions and call sites the diff touches. Optional: full file history, unrelated modules, the entire repo. Fill from the top tier until the budget runs out, then stop. The optional tier rarely earns the tokens it costs.
def assemble_context(task, diff, memory, transcript, budget):
parts = []
parts.append(SYSTEM_PROMPT) # tier 0: always
parts.append(render_task(task)) # tier 1: pinned, must-have
parts.append(render_memory(memory)) # tier 1: durable findings
parts.append(render_diff(diff)) # tier 1: the thing under review
# tier 2: pull in only files the diff references, until budget runs out
for f in files_referenced_by(diff):
chunk = read_relevant_slice(f, diff)
if tokens(parts) + tokens(chunk) > budget.context_ceiling:
break
parts.append(chunk)
# tier 3 (history, neighbors) is intentionally omitted unless asked for
parts.extend(recent_turns(transcript, budget.transcript))
return join(parts)
3. Fetch on demand, do not dump everything in
There are two ways to get a file in front of the model: stuff it into the prompt up front, or hand the model a lightweight identifier (a path) and let it pull the slice it needs with a tool. Prefer the second. It is how a human reviewer works: you do not read the whole repo, you open the file the diff mentions and jump to the changed function. Pre-loading is fine for small, stable references (a style guide, the PR title); fetching on demand wins for anything large or rarely needed. Claude Code itself works this way: it loads a small CLAUDE.md up front and uses grep/glob at runtime for everything else.
import anyio
from claude_agent_sdk import query, ClaudeAgentOptions
# Anti-pattern: read every changed file in full and stuff it into the prompt,
# burying the signal and burning the budget
files = ["src/payments/refund.py", "src/payments/gateway.py", "..."] # all 40
dump = "\n\n".join(open(p).read() for p in files)
async def main():
async for message in query(
prompt=f"Review these files for bugs and risks:\n{dump}", # dumped up front
options=ClaudeAgentOptions(allowed_tools=[]), # no retrieval left
):
handle(message)
anyio.run(main)
# Pattern: give only the path list, grant Read/Grep/Glob, let the model fetch
# the slices it needs on demand
async def main():
file_list = ", ".join(["src/payments/refund.py", "src/payments/gateway.py", "..."])
async for message in query(
prompt=f"Review this PR for bugs and risks. Files changed: {file_list}. "
f"Use Read to pull only the slices you need; do not read every file at once.",
options=ClaudeAgentOptions(
allowed_tools=["Read", "Grep", "Glob"], # just-in-time retrieval
cwd="./repo",
),
):
handle(message)
anyio.run(main)
import { query } from "@anthropic-ai/claude-agent-sdk";
// Anti-pattern: read every changed file in full and stuff it into the prompt,
// burying the signal and burning the budget
const files = ["src/payments/refund.py", "src/payments/gateway.py", "..."]; // all 40
const dump = files.map((p) => readFileSync(p, "utf8")).join("\n\n");
for await (const message of query({
prompt: `Review these files for bugs and risks:\n${dump}`, // dumped up front
options: { allowedTools: [] }, // no retrieval left
})) {
handle(message);
}
// Pattern: give only the path list, grant Read/Grep/Glob, let the model fetch
// the slices it needs on demand
const fileList = ["src/payments/refund.py", "src/payments/gateway.py", "..."].join(", ");
for await (const message of query({
prompt: `Review this PR for bugs and risks. Files changed: ${fileList}. ` +
`Use Read to pull only the slices you need; do not read every file at once.`,
options: {
allowedTools: ["Read", "Grep", "Glob"], // just-in-time retrieval
cwd: "./repo",
},
})) {
handle(message);
}
4. Compact at a threshold, not when it overflows
When the running transcript crosses the compaction threshold (80k in the table, not the 200k ceiling), summarize the oldest turns into a compact note and replace them in place. Prioritize recall first: keep architectural decisions, confirmed bugs, and open questions; drop verbose tool output that has already been acted on. Keep the most recent turns verbatim, because that is where the model is actively working.
def maybe_compact(messages, budget):
if tokens(messages) < budget.compact_threshold:
return messages
old, recent = split_keeping_last(messages, n=budget.keep_recent_turns)
summary = model.summarize(
old,
keep=["decisions made", "bugs confirmed", "files reviewed", "open questions"],
drop=["raw file dumps", "superseded reasoning"],
)
return [PINNED_TASK, PINNED_MEMORY, as_note(summary), *recent]
5. Pin the facts that matter and keep external memory
Some facts must never be summarized away: the task, the acceptance criteria, the running list of findings. Pin these at a stable position and write findings to an external note (a REVIEW_NOTES.md style file) that the agent re-reads after every compaction or reset. This is the structured note-taking pattern: the model writes its memory to disk and loads it back when it needs it, so a long review survives a context reset, the way you can step away for a coffee and pick up where you left off by checking your notes.
# After each file is reviewed, append a durable line to memory
memory.append(f"- {path}: found {finding}; severity {sev}; suggest {fix}")
write("REVIEW_NOTES.md", memory)
# On the next loop, assemble_context() re-reads REVIEW_NOTES.md into tier 1
6. Check for drift
Every so often, re-state the goal in the prompt and check the agent is still working toward it. Drift is when the agent slowly forgets the original task as the transcript fills up with tool noise. A cheap probe: every few turns, ask the model to restate what it is reviewing and what it has found. If the answer starts wandering, your compaction is dropping too much, or your task is not pinned firmly enough.
Learned: deciding what to keepReviewing a 40-file PR, Xiaoman no longer stuffs the whole repo into the window. It fills the budget in tiers, pulls only the files the diff touches, and compacts old turns into a note once they cross the threshold.
How to verify
- Run a review on a large diff (20+ files). Assert in a test that
tokens(assemble_context(...))stays under your ceiling after several tool calls. - Trigger compaction deliberately on a long transcript and confirm the summary still contains the task, the confirmed bugs, and the open questions. Diff the pre- and post-compaction findings list: nothing in the durable tier should disappear.
- Run the same large PR twice: once with naive append, once with compaction plus memory. The compacted run should stay on task longer and cost fewer tokens to reach the same verdict.
- Probe for drift: ask “what are you reviewing right now?” near the end of a long run. The answer should still match the original task.
Learned: confirming nothing was lostYou can trigger a compaction on a long transcript and diff the findings list before and after, confirming the task, the confirmed bugs, and the open questions all survive.
Recap
You now treat context as a budget you manage, not a place to throw everything. You named a budget and split it into slices, tiered the content and filled top-down, fetched on demand instead of dumping everything in, compacted at a threshold before the window overflows, pinned the facts that matter to external memory, and built a probe for drift. The mechanism underneath is simple: a finite attention budget plus quadratic attention means signal density beats raw volume. That is why long agent tasks drift, and these six moves are how you fight it. The next chapter packages this reviewer into a discoverable, reusable Agent Skill.
Common pitfalls
- Stuffing everything in. A full repo in the prompt buries the signal and raises cost on every turn. Fetch slices on demand instead.
- Compacting too late. If you wait until the window overflows, the request fails outright. Compact at a threshold below the ceiling, with margin for one large tool result.
- Summarizing away the task. Aggressive compaction can erase the goal or a confirmed bug. Pin the task and key decisions outside the summarized region, and bias the summarizer toward recall first.
- Pre-loading what is rarely used. Loading the whole style guide and every neighbor file “just in case” spends the budget on tokens the model never reads. Keep optional context optional.
- No external memory. Without a notes file, every compaction is lossy and irreversible. Write findings to disk so the agent can reload them.
Xiaoman drops a slice of context it judged unimportant for the first time, and misses one line that mattered. It learns to choose, and tastes the cost of choosing. The Memory Maze lights up.
Just lit The Memory Maze · 4 / 16 lit
Sources
- Anthropic: Effective context engineering for AI agents · official
- Microsoft: AI Agents for Beginners · official