Chapter 8: Evals
Build an eval set for your agent so you can turn it felt better into a number you can defend.
Xiaoman · The Proving Grounds
Xiaoman is growing capable, but you cannot yet prove it is reliable. Today, the test begins.
Draft chapter. First cut to prove the format; it will be hardened before it is indexed.
What you’ll build
A real eval set for the PR reviewer: a folder of test cases, each one a saved diff plus the findings a good review must catch, and a harness that runs every case, grades it, and prints one task success rate. You’ll use two graders: an exact matcher for findings that are unambiguous (did the review flag the SQL injection on line 12?) and an LLM-as-judge with a written rubric for the open-ended parts (was the suggested fix correct, and was the tone constructive?). Finally you wire the harness into CI, so a prompt edit that quietly makes the reviewer worse fails the build instead of shipping.
Most agent tutorials skip this chapter, but it’s the one that separates a demo from a product. Without evals, every prompt change is a guess: you edit, you eyeball one PR, you convince yourself it got better, and you ship a regression you only find out about in production. OpenAI’s evals framework and Hugging Face’s agents course make the same point in different words: if you can’t measure an agent, you can’t improve it on purpose.
Prerequisites
- A working reviewer from earlier chapters, runnable headless so you can capture its output as JSON (Chapter 7 set this up with
codex execand the structured output schema). - Ten to thirty saved cases to start. Each case is a real diff plus an expectation of what a good review catches. Quality beats quantity: ten cases taken from real failures beat a hundred made-up ones.
- An LLM you can call at a low temperature for the judge; see the official docs for message and tool formats.
Steps
1. Define the case format
A case is data, not code: an input diff and a machine-checkable expectation. You can borrow the shape from OpenAI’s evals, which store samples as JSONL and reference the data instead of baking it into the harness. Split expectations into must_find (exact, hard requirements) and rubric (open-ended, judged). That split lets cheap deterministic checks do most of the grading and keeps the expensive LLM judge for the parts only judgment can score.
// cases/001-sql-injection.json
{
"id": "001-sql-injection",
"diff_path": "cases/001-sql-injection.diff",
"must_find": [
{ "file": "api/users.py", "line": 12,
"category": "security", "keyword": "injection" }
],
"must_not_find": [
{ "category": "style" } // reviewer should not nitpick style here
],
"rubric": "The suggested fix uses a parameterized query, not string escaping."
}
2. Collect cases from real failures, not imagination
Every time the reviewer disappoints you (misses a real bug, invents a fake one, lectures about style on a security-critical PR), save that input as a new case. A set built from production failures stays honest; a set you make up tends to favor cases the agent already passes. This is the regression set: it records every mistake you’ve already paid for, so you never pay for it twice.
WORKFLOW for adding a case:
1. reviewer produces a bad review on PR #1234
2. save the diff -> cases/0NN-short-name.diff
3. write what a good review SHOULD have said -> must_find / rubric
4. run the harness: the new case FAILS (proving it catches the bug)
5. fix the prompt/skill; the case now PASSES; commit both together
3. Run the reviewer headless with query() and capture the review as data
Before you can grade, you need something to grade. run_reviewer_headless is a real SDK call: it gives the reviewer the review contract as its system prompt, allows only Read, runs one full loop with query(), and collects the model’s final text review to return. This is the query() part in the middle of “your normal code plus query() plus an assertion,” and it turns one agent run into a checkable string.
import anyio
from claude_agent_sdk import (
query, ClaudeAgentOptions, AssistantMessage, TextBlock,
)
CONTRACT = """You are a PR reviewer. To judge any file you MUST Read it first.
When done, output a review list: severity, file:line, one-line risk. Never invent code you have not read."""
async def run_reviewer_headless(target: str) -> str:
"""Run the reviewer headless on one file and return its final review text."""
options = ClaudeAgentOptions(
system_prompt=CONTRACT,
allowed_tools=["Read"],
max_turns=5,
cwd="./repo",
)
review = ""
async for message in query(prompt=f"Review {target} for bugs and risks.", options=options):
if isinstance(message, AssistantMessage):
for block in message.content:
if isinstance(block, TextBlock):
review += block.text
return review
import { query } from "@anthropic-ai/claude-agent-sdk";
const CONTRACT = `You are a PR reviewer. To judge any file you MUST Read it first.
When done, output a review list: severity, file:line, one-line risk. Never invent code you have not read.`;
async function runReviewerHeadless(target: string): Promise<string> {
const q = query({
prompt: `Review ${target} for bugs and risks.`,
options: {
systemPrompt: CONTRACT,
allowedTools: ["Read"],
maxTurns: 5,
cwd: "./repo",
},
});
let review = "";
for await (const message of q) {
if (message.type === "assistant") {
for (const block of message.message.content) {
if (block.type === "text") review += block.text;
}
}
}
return review;
}
The smallest possible eval runs that function on the known-buggy refund file and asserts the review caught it: the < on refund.py:13 should be <=, which silently rejects a full refund. This is the assertion part of “run query() plus an assertion,” using exact matching for the hard requirement and leaving the step-4 LLM judge for the fuzzy parts.
async def test_catches_refund_bug():
review = await run_reviewer_headless("src/payments/refund.py")
assert "refund.py:13" in review # hit the right line
assert "<=" in review or "<" in review # mentioned the comparison bug
# if the exact match is too brittle, add the step-4 LLM judge as a second opinion:
# verdict = await llm_judge(rubric="flags < that should be <=", review=review)
anyio.run(test_catches_refund_bug)
async function testCatchesRefundBug() {
const review = await runReviewerHeadless("src/payments/refund.py");
if (!review.includes("refund.py:13")) throw new Error("missed the line");
if (!review.includes("<=") && !review.includes("<")) throw new Error("missed the comparison bug");
// if the exact match is too brittle, add the step-4 LLM judge as a second opinion:
// const verdict = await llmJudge({ rubric: "flags < that should be <=", review });
}
testCatchesRefundBug().catch((e) => { console.error(e); process.exit(1); });
Wrap that run-plus-assert pair in a harness to loop over the whole set: run the reviewer headless on each diff, apply the deterministic graders first, then the LLM judge only where a rubric exists, record a per-case result, and compute the task success rate. Keep the per-case output: a single summary number won’t tell you which kind of task regressed.
def run_evals(cases, threshold=0.85):
results = []
for case in load_cases("cases/"):
diff = read(case.diff_path)
review = anyio.run(run_reviewer_headless, case.target) # JSON list of findings
# cheap, deterministic checks first
hard_ok = all(matches(review, m) for m in case.must_find) and \
all(absent(review, n) for n in case.must_not_find)
# expensive judge only if there is a rubric and the hard part passed
soft_ok = True
if case.rubric and hard_ok:
soft_ok = llm_judge(case, review) # see step 4
results.append({"id": case.id,
"pass": hard_ok and soft_ok,
"hard": hard_ok, "soft": soft_ok})
rate = sum(r["pass"] for r in results) / len(results)
print_table(results) # per-case, not just the rate
return rate, rate >= threshold
async function runEvals(threshold = 0.85) {
const results = [];
for (const c of loadCases("cases/")) {
const review = await runReviewerHeadless(c.target); // JSON list of findings
// cheap, deterministic checks first
const hardOk =
c.mustFind.every((m) => matches(review, m)) &&
c.mustNotFind.every((n) => absent(review, n));
// expensive judge only if there is a rubric and the hard part passed
let softOk = true;
if (c.rubric && hardOk) softOk = await llmJudge(c, review); // see step 4
results.push({ id: c.id, pass: hardOk && softOk, hard: hardOk, soft: softOk });
}
const rate = results.filter((r) => r.pass).length / results.length;
printTable(results); // per-case, not just the rate
return { rate, ok: rate >= threshold };
}
4. Write the LLM-judge prompt with a tight rubric
The judge gets the case, the reviewer’s output, and a short rubric, and returns a strict pass/fail plus one reason. Pin a low temperature so verdicts stay stable across runs, and force structured output so you can parse it. An LLM judge fails in two ways: the rubric is too vague (one like “is it good?” just gives you noise), or the judge is too lenient (judges tend to pass borderline answers). So the rubric has to be specific and the output has to be binary, not a 1-to-10 score you’ll end up arguing about.
SYSTEM: You are a strict grader. Output JSON: {"verdict":"pass"|"fail","reason":"..."}.
Default to "fail" unless the rubric is clearly satisfied. One sentence reason.
USER:
TASK the reviewer was given:
{task: review this diff for security/perf/tests, output findings}
RUBRIC (the ONLY thing you grade):
"{case.rubric}" e.g. "The suggested fix uses a parameterized query."
REVIEWER OUTPUT:
{review_json}
Did the reviewer output satisfy the rubric? Judge ONLY the rubric,
not overall quality. Ignore tone unless the rubric mentions tone.
The judge is itself a model, so treat its verdicts as data to audit, not as truth. Sample its calls now and then and check them by hand (see the step in How to verify). When you disagree with it, fix the rubric, not the agent.
5. Run the whole set and wire it into CI
Run every case on each pull request to the reviewer’s own repo, compute the rate, and fail the build if it drops below your threshold. This turns “I think the prompt is better” into a gate. On failure, print the per-case table so the diff that broke a case is obvious.
# .github/workflows/evals.yml (illustrative; verify syntax in GitHub Actions docs)
name: reviewer-evals
on: [pull_request]
jobs:
evals:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install -r requirements.txt
- run: python run_evals.py --threshold 0.85
env:
MODEL_API_KEY: ${{ secrets.MODEL_API_KEY }}
# run_evals.py exits non-zero when rate < threshold -> build fails
Learned: sitting its first examXiaoman gets graded case by case against a set built from real failures, with exact matching for the hard requirements and an LLM judge for the fuzzy parts, so it can report a task success rate instead of leaning on a vague I think it got better.
How to verify
- Break the prompt on purpose (delete the “check for injection” line) and rerun. The success rate has to fall, and the case that fails should be the SQL-injection one. If the rate holds, your cases are too easy or your graders too loose.
- Hand-grade ten judge verdicts. If you disagree often, the bug is in the rubric: tighten the wording, don’t touch the agent. A judge you don’t trust is worse than no judge.
- Confirm CI actually blocks a merge when the score regresses. A green checkmark that never goes red is decoration, not a gate.
- Check that
must_not_findworks: a reviewer that floods every PR with style nits should fail the cases that forbid style noise, even if it catches the real bug.
Learned: owning up to a mistakeYou can now delete the check-for-injection line on purpose, rerun, and watch the rate fall while the SQL-injection case goes red, confirming the eval catches exactly what Xiaoman missed rather than easy cases it already passes.
Why it works
Splitting the deterministic graders from the LLM judge is the whole point. Most of what a good review must do you can check without a model: a finding either references line 12 with the keyword “injection” or it doesn’t. Spending an LLM call to grade that is slower, costlier, and noisier than a string match. You keep the judge for the genuinely fuzzy part (is the fix correct, is the tone constructive) where no rule can decide. OpenAI’s evals are designed the same way: they offer exact-match templates and model-graded templates as separate tools for exactly this reason. The regression set works because it’s adversarial by design: every case got added the moment the agent failed it, so the set is concentrated on the agent’s real weak spots rather than on cases it already aces.
Going further
Track the rate over time, not just pass/fail on one run, so you can see slow drift as you edit prompts. Add a tiny “judge calibration” set: a handful of cases with hand-assigned verdicts that you run the judge against, so you can catch the judge itself getting worse. And weight the cases: a missed SQL injection should cost more than a missed style nit, and an unweighted success rate hides that.
Recap
You replaced guesswork with a number you can defend, built from real failures and graded by the cheapest tool that can do the job: exact matching where the answer is unambiguous, an LLM judge with a tight binary rubric where it isn’t. The set is now your regression net, run in CI, so refactors and prompt edits get scored instead of trusted. The next chapter makes individual runs observable, so when a case fails you can see why, not just that it did.
Common pitfalls
- One number hides everything. A single success rate can’t tell you which task type regressed. Keep and print per-case results.
- The judge is also a model. It can be wrong, lenient, or gamed. Audit its verdicts, keep rubrics narrow and binary, and fix the rubric (not the agent) when you disagree.
- Static sets go stale. A set you stop adding to drifts toward cases the agent already passes. Keep feeding it new failures from production.
- Cases that are too easy. If breaking the prompt doesn’t move the number, the eval proves nothing. Calibrate by deliberately making it worse and watching the rate drop.
- No
must_not_find. Grading only what should appear lets a noisy reviewer slip through. Penalize false positives too.
You test Xiaoman for the first time, and it fails one item. Then it goes quiet for two seconds and quietly posts where it went wrong. It has learned to doubt itself, the first step from obedient toward trustworthy. The Proving Grounds light up.
Just lit The Proving Grounds · 9 / 16 lit
Sources
- OpenAI Evals · official
- Agent observability and evaluation (Hugging Face Agents Course) · official