2026-06-11 · harness

Chapter 10: Error handling and reliability

Make your agent survive tool failures, timeouts and bad arguments without corrupting state or looping forever.

Xiaoman · The Rising Slope

You can see it now, but it crashes at the first error. Today you teach it to fall and get back up.

What you’ll build

A reliability layer around the PR-reviewer loop that handles the four failures every real agent hits: a tool that errors, a tool that hangs, a model that hallucinates arguments, and a multi-step action that fails halfway. By the end the reviewer degrades gracefully (it says what it could not do and stops) instead of crashing, hanging, or double-posting comments.

The whole layer is pretty simple: one wrapper that sits between the loop and every tool and guards each call, plus a few policies (retry budget, idempotency, human gate) on top of it.

Prerequisites

The agent loop and tracing from Chapter 9. You will tag failures onto spans here.
At least one tool with a side effect, such as posting a review comment to a PR.
Tool schemas (a JSON Schema or pydantic model per tool) so you can validate arguments. See the official tool-use docs for the schema format.

Steps

1. Validate arguments before acting

The model can hallucinate a field, a path, or a type. A failure you catch before any side effect happens is the cheapest one to deal with. The SDK gives you a checkpoint that runs before any tool executes: the can_use_tool callback. On a validation failure, return PermissionResultDeny(message=...); the SDK feeds that readable message back to the model as an observation instead of throwing an exception that kills the run. The model usually fixes itself on the next turn.

from claude_agent_sdk import (
    ClaudeAgentOptions, ClaudeSDKClient,
    PermissionResultAllow, PermissionResultDeny, ToolPermissionContext,
)

async def can_use_tool(tool_name: str, tool_input: dict, context: ToolPermissionContext):
    spec = TOOLS.get(tool_name)
    if spec:
        try:
            spec.schema.validate(tool_input)      # JSON Schema / pydantic
        except SchemaError as e:
            return PermissionResultDeny(
                message=f"invalid arguments for {tool_name}: {e}. Re-read the schema and retry.")
    # domain checks the schema cannot express:
    if tool_name == "post_comment" and tool_input.get("pr_id") != CURRENT_PR:
        return PermissionResultDeny(
            message=f"refusing: pr_id {tool_input.get('pr_id')} is not the PR under review ({CURRENT_PR}).")
    return PermissionResultAllow()

options = ClaudeAgentOptions(can_use_tool=can_use_tool, permission_mode="default")

import { query } from "@anthropic-ai/claude-agent-sdk";

async function canUseTool(toolName, toolInput, context) {
  const spec = TOOLS[toolName];
  if (spec) {
    try {
      spec.schema.validate(toolInput);      // JSON Schema / zod
    } catch (e) {
      return { behavior: "deny",
               message: `invalid arguments for ${toolName}: ${e}. Re-read the schema and retry.` };
    }
  }
  // domain checks the schema cannot express:
  if (toolName === "post_comment" && toolInput.pr_id !== CURRENT_PR) {
    return { behavior: "deny",
             message: `refusing: pr_id ${toolInput.pr_id} is not the PR under review (${CURRENT_PR}).` };
  }
  return { behavior: "allow", updatedInput: toolInput };
}

const options = { canUseTool, permissionMode: "default" };

2. Wrap every tool in a timeout and a retry budget with backoff

A run that hangs or drops should not take the whole process down. The SDK sorts low-level failures into a few explicit exception types: CLINotFoundError (the CLI is not installed, a permanent error that retrying cannot fix), ProcessError (with exit_code and stderr), CLIConnectionError, and CLIJSONDecodeError. Handle them by type: transient failures are worth a retry, permanent ones are not, and back off exponentially with jitter so you do not keep hammering a service that is already struggling. Cap the attempts too. Backoff plus a retry-count limit stops one flaky dependency from setting off a pile-up of retries.

from claude_agent_sdk import (
    query, CLINotFoundError, ProcessError, CLIConnectionError, CLIJSONDecodeError,
)

async def run_with_retry(prompt, options, *, attempts=3):
    for i in range(attempts):
        try:
            results = []
            async for message in query(prompt=prompt, options=options):
                results.append(message)
            return results
        except CLINotFoundError:
            raise                                       # permanent: do not retry
        except (ProcessError, CLIConnectionError, CLIJSONDecodeError) as e:
            if i == attempts - 1:
                raise
            await anyio.sleep(min(2 ** i, 8) + random() * 0.5)   # backoff + jitter

import { query } from "@anthropic-ai/claude-agent-sdk";

async function runWithRetry(prompt, options, attempts = 3) {
  for (let i = 0; i < attempts; i++) {
    try {
      const results = [];
      for await (const message of query({ prompt, options })) results.push(message);
      return results;
    } catch (e: any) {
      // permanent errors like CLINotFoundError are not retried; transient ones back off
      const permanent = e?.name === "CLINotFoundError";
      if (permanent || i === attempts - 1) throw e;
      await new Promise((r) => setTimeout(r, Math.min(2 ** i, 8) * 1000 + Math.random() * 500));
    }
  }
}

Per-tool timeouts and retries (HTTP deadlines and so on) still live inside your tool implementations; this layer retries the whole query() run.

3. Return errors as observations, not crashes

In the loop the SDK runs, a tool failure is just one more message in the stream: a failed tool result arrives as a ToolResultBlock (is_error=True) inside a UserMessage fed back to the model, which adapts: try a different file, skip a broken linter, or give up cleanly. Your job is to iterate the stream and not let any single tool failure break the loop. max_turns caps the loop so a model that keeps retrying the same doomed call cannot spin forever. Whether the run succeeded is on the final ResultMessage, via is_error and subtype. This is the most important reliability rule: when you iterate the stream, a tool failure never throws, you only ever record it as an observation.

from claude_agent_sdk import (
    AssistantMessage, UserMessage, ToolUseBlock, ToolResultBlock, ResultMessage,
)

async def run_loop(prompt, options):
    async for message in query(prompt=prompt, options=options):
        if isinstance(message, AssistantMessage):
            for block in message.content:
                if isinstance(block, ToolUseBlock):
                    span = tracer.start_span("tool.call")
                    span.set_attribute("tool.name", block.name)
        elif isinstance(message, UserMessage):
            for block in message.content:
                if isinstance(block, ToolResultBlock) and block.is_error:
                    # failure comes back as an observation, not a crash; the model adapts
                    tracer.start_span("tool.error").set_attribute("error", str(block.content))
        elif isinstance(message, ResultMessage):
            return {"ok": not message.is_error, "subtype": message.subtype}

async function runLoop(prompt, options) {
  for await (const message of query({ prompt, options })) {
    if (message.type === "assistant") {
      for (const block of message.message.content) {
        if (block.type === "tool_use") {
          tracer.startSpan("tool.call").setAttribute("tool.name", block.name);
        }
      }
    } else if (message.type === "user") {
      for (const block of message.message.content) {
        if (block.type === "tool_result" && block.is_error) {
          // failure comes back as an observation, not a crash; the model adapts
          tracer.startSpan("tool.error").setAttribute("error", String(block.content));
        }
      }
    } else if (message.type === "result") {
      return { ok: !message.is_error, subtype: message.subtype };
    }
  }
}

Set max_turns (py) / maxTurns (ts) to cap the loop; reuse the options from step 1 that carries can_use_tool.

4. Make side effects idempotent

If step 3 retries a comment-posting tool, you must not post the same comment twice. Give each side-effecting action an idempotency key derived from its meaningful inputs, so the same inputs always produce the same key, and have the tool (or the remote API) treat a repeat key as a no-op that returns the original result. Same key, same result, no duplicates.

# the body of the post_comment tool you registered as an SDK custom tool (@tool)
def post_comment(pr_id, path, line, body):
    key = sha256(f"{pr_id}:{path}:{line}:{body}".encode()).hexdigest()
    if key in posted_keys:                  # or: pass as Idempotency-Key header
        return posted_keys[key]
    res = github.create_review_comment(pr_id, path, line, body)
    posted_keys[key] = res
    return res

// the body of the post_comment tool you registered as an SDK custom tool
function postComment(prId, path, line, body) {
  const key = sha256(`${prId}:${path}:${line}:${body}`);
  if (postedKeys.has(key)) return postedKeys.get(key);   // or: pass as Idempotency-Key header
  const res = github.createReviewComment(prId, path, line, body);
  postedKeys.set(key, res);
  return res;
}

5. Roll back partial failures

A “submit review” action might post five inline comments and then a summary. If the summary fails, you are left with five orphan comments and a confused author. So track what succeeded inside a unit of work and undo it on failure, or design the steps so a half-done run is safe to rerun (which, with idempotency from step 4, often means you can just retry the whole unit).

def submit_review(pr_id, comments, summary):
    done = []
    try:
        for c in comments:
            done.append(post_comment(pr_id, **c))       # idempotent
        post_summary(pr_id, summary)
    except Exception:
        for res in done:
            delete_comment(res["id"])                   # compensating action
        raise AgentError("review rolled back", step="submit_review")

function submitReview(prId, comments, summary) {
  const done = [];
  try {
    for (const c of comments) done.push(postComment(prId, c.path, c.line, c.body)); // idempotent
    postSummary(prId, summary);
  } catch (e) {
    for (const res of done) deleteComment(res.id);      // compensating action
    throw new AgentError("review rolled back", "submit_review");
  }
}

6. Add a human-in-the-loop checkpoint

Some actions are too costly to leave to automation. Before anything irreversible or high blast radius (approving and auto-merging a PR, posting to a public repo), pause and require an explicit approval. Microsoft’s “Building Trustworthy AI Agents” lesson calls this keeping a human in the loop for consequential decisions; in practice it is a gate inside can_use_tool that the agent cannot pass without an out-of-band yes. On denial, return PermissionResultDeny; the model takes it as an observation and moves on. Fold this into the same can_use_tool from step 1.

HIGH_RISK = {"approve_and_merge", "force_push", "post_to_public_repo"}

async def can_use_tool(tool_name, tool_input, context):
    if tool_name in HIGH_RISK:
        if not await approvals.request(run_id, tool_name, tool_input):   # blocks for human yes/no
            return PermissionResultDeny(message="action declined by reviewer")
    # ... continue with the schema / domain checks from step 1 ...
    return PermissionResultAllow()

const HIGH_RISK = new Set(["approve_and_merge", "force_push", "post_to_public_repo"]);

async function canUseTool(toolName, toolInput, context) {
  if (HIGH_RISK.has(toolName)) {
    if (!(await approvals.request(runId, toolName, toolInput))) {        // blocks for human yes/no
      return { behavior: "deny", message: "action declined by reviewer" };
    }
  }
  // ... continue with the schema / domain checks from step 1 ...
  return { behavior: "allow", updatedInput: toolInput };
}

Learned: staying on its feetWhen a tool errors, hangs, or gets bad arguments, Xiaoman no longer crashes. It takes the failure in as an observation, tries another way, and stops to ask you before doing anything dangerous.

How to verify

Kill a tool mid-call (point it at a host it can’t reach). Confirm that after the timeout the run reports a clean ok: false observation, the model adapts, and the process does not hang.
Feed a hallucinated argument (a pr_id for a different PR). Confirm validation rejects it with a readable message and the model recovers on the next turn.
Trigger the same post_comment twice with identical inputs. Confirm the idempotency key blocks the duplicate and both calls return the same comment id.
Force post_summary to fail and confirm the inline comments get rolled back, leaving the PR clean.

Learned: checking it gets back upYou can force a timeout, feed a hallucinated argument, and fire the same comment twice, then watch it retry, correct itself, skip the duplicate, and roll back cleanly when a step fails halfway.

Why it works

The idea running through the chapter is that an agent loop should treat the messy real world as input, not as an excuse to crash. Validation, timeouts, and retries turn that mess into structured observations the model can reason about; idempotency and rollback make side effects safe to repeat; the human gate limits the blast radius of the actions you are not ready to fully trust. Each is a small policy, and together they are what separates a demo from something you can leave running.

Recap

The reviewer now survives the four failures: it validates before acting, times out and retries transient errors, feeds every failure back as an observation, never double-posts, rolls back partial work, and stops for a human on the dangerous stuff. With the observability from Chapter 9 on top, you can both see failures and absorb them.

Common pitfalls

Retrying non-idempotent actions. Without a key, every retry repeats the side effect. Add idempotency before you add retries.
Retrying permanent errors. A 400 will fail the same way three times. Only retry on retryable errors, and back off.
Swallowing errors silently. Always surface the failure to the trace and to the model. A silent failure is a bug you will spend hours chasing later.
No step cap. A model that keeps retrying a doomed call will loop until you run out of budget. Cap loop steps and retry attempts.
No human gate on destructive steps. Some actions (auto-merge, force-push) are too costly to automate fully.

Xiaoman falls and picks itself back up for the first time, retrying, degrading gracefully, without crashing. But you also catch it bluffing, padding its words to say it is fine when it is not. The Rising Slope lights up.

0123456789101112131415

Just lit The Rising Slope · 11 / 16 lit

It is capable enough now, capable enough to do real damage. Next: the Warded Threshold.

Sources

Anthropic Claude Docs: Tool use · official
Microsoft AI Agents for Beginners: Building Trustworthy AI Agents · official

UP NEXT · CHAPTER 11 Guardrails & sandboxing