2026-06-11 · harness

Chapter 15: Capstone, ship and publish

Take the PR reviewer from demo to trustworthy, package it as a skill, and run a closing checklist across the whole book.

Xiaoman · Full Bloom

All this way, Xiaoman has grown from an echo into what it is today. In this last chapter, let it come full.

What you’ll build

The final shape of the PR reviewer: a packaged, published skill that a teammate can install in a clean environment and trust on their own repos. This is the last chapter, so it does two jobs. First, it puts together the full system, every piece you built across the previous fourteen chapters, into one diagram and one folder. Second, it gets you across the gap from a demo to something you can actually trust: the small set of properties that turn “it impressed me once” into “I’d let it comment on my pull requests with nobody watching.”

By the end you have a pr-reviewer/ skill folder with a SKILL.md, a written contract, and a runbook, plus a closing checklist where every earlier chapter maps to one line you can only tick once you have evidence.

Prerequisites

A deployed, versioned agent from Chapters 13 and 14, with rollback one flag away.
Your eval suite green on the current release.
The official Agent Skills docs open, so you can match the package to the structure they expect (SKILL.md with name and description frontmatter).

Steps

1. See the whole system at once

Before packaging, draw what you actually built. Each chapter built one part; this chapter is where you bolt them together. Sketching it makes any missing part obvious, and the sketch doubles as the architecture section of your docs.

                    GitHub PR webhook
                           |
                  [ deployment  ch13 ]      <- runs the loop, scopes the token
                           |
                  [ release router  ch14 ]  <- canary %, version stamp, rollback
                           |
            +------------- agent loop  ch3 -------------+
            |  system prompt = contract       ch2       |
            |  context: diff + repo facts     ch4       |
            |  SKILL.md: review procedure     ch5       |
            |  hooks / slash entry points     ch6       |
            |  subagent: deep security pass   ch7       |
            |  guardrails: sandbox + scope    ch11      |
            +-------------------+-----------------------+
                                |
                   posts review comments (only)
                                |
        [ evals ch9 ] [ observability ch10 ] [ cost ch12 ] watch from the side

The two parts that actually earn trust are not on the happy path: the guardrails that make the agent fail closed, and the observability that lets you prove what it did. Keep them in the picture so you never end up shipping the loop without them.

2. Earn the trust gap

A demo can be impressive and still not be shippable. Impressive tells you how it does when things go well; trust is about what it does when things go wrong. Three properties get you across the gap, and you can actually check each one, they are not just nice words. First, the agent fails closed: on an ambiguous or oversized diff it says “could not review” instead of guessing. Second, its credentials only let it comment, so even a fully compromised prompt cannot merge, push, or delete. Third, every action is logged with the version stamp from Chapter 14, so you can explain any review after the fact.

# the trust contract, enforced in code not vibes
from claude_agent_sdk import query, ClaudeAgentOptions, AssistantMessage, TextBlock

async def review(pr):
    if pr.diff_lines > MAX_DIFF or pr.is_binary_only:
        return comment("Could not review: diff too large / binary-only. "
                       "Flagging for human review.")        # fail closed
    options = ClaudeAgentOptions(
        system_prompt=CFG.contract,
        allowed_tools=["Read"],        # read-only: the ch11 guardrail, no writes or commands
        max_turns=6,
        cwd=pr.repo,
    )
    findings = ""
    async for message in query(prompt=f"Review PR #{pr.id}.", options=options):  # ch3 loop
        if isinstance(message, AssistantMessage):
            for block in message.content:
                if isinstance(block, TextBlock):
                    findings += block.text
    log.info("review", version=CFG.version, pr=pr.id)   # ch10: stamped with the version
    return post_comments(findings)     # token scope = comment-only, ch11

// the trust contract, enforced in code not vibes
import { query } from "@anthropic-ai/claude-agent-sdk";

async function review(pr) {
  if (pr.diffLines > MAX_DIFF || pr.isBinaryOnly) {
    return comment("Could not review: diff too large / binary-only. " +
                   "Flagging for human review.");           // fail closed
  }
  const options = {
    systemPrompt: CFG.contract,
    allowedTools: ["Read"],        // read-only: the ch11 guardrail, no writes or commands
    maxTurns: 6,
    cwd: pr.repo,
  };
  let findings = "";
  for await (const message of query({ prompt: `Review PR #${pr.id}.`, options })) {  // ch3 loop
    if (message.type === "assistant") {
      for (const block of message.message.content) {
        if (block.type === "text") findings += block.text;
      }
    }
  }
  log.info("review", { version: CFG.version, pr: pr.id });   // ch10: stamped with the version
  return postComments(findings);     // token scope = comment-only, ch11
}

A reviewer that can be wrong but cannot do damage is shippable. A reviewer that is usually right but occasionally destructive is not.

3. Package it as a skill

Now put the procedure into a structure Claude can discover and load. A skill is a self-contained folder whose entry point is SKILL.md: YAML frontmatter with two required fields, name (lowercase, hyphens, no “claude”/“anthropic”) and description (what it does and when to use it), followed by the instructions. Heavy or rarely-needed material goes in sibling files that load only when referenced. This is the progressive-disclosure model from the official docs: the description is always in context, the body loads when triggered, and reference files load on demand.

pr-reviewer/
├── SKILL.md            # entry: frontmatter + the review procedure
├── CHECKLIST.md        # what to check, by category (loaded when needed)
├── SECURITY.md         # the deeper security pass (the ch7 subagent's brief)
└── scripts/
    └── post_review.py  # deterministic comment posting (ch11 scope baked in)

# pr-reviewer/SKILL.md
---
name: pr-reviewer
description: Reviews a pull request diff for correctness, security, and clarity, then posts inline comments. Use when asked to review a PR, check a diff before merge, or audit changes for risky patterns.
---

# PR Reviewer

## When to stop
If the diff exceeds the size budget or is binary-only, post "could not review"
and flag for a human. Never guess.

## Procedure
1. Summarize the change in one sentence (see CHECKLIST.md for categories).
2. Run the correctness and clarity passes.
3. For auth, crypto, or input-handling changes, follow SECURITY.md.
4. Post findings with `scripts/post_review.py` (comment scope only).

Keep SKILL.md short and push the detail into CHECKLIST.md and SECURITY.md. A bloated entry file burns context budget every time the skill triggers, even when the task is a one-line typo fix.

4. Document the contract honestly

What makes people trust a tool is not a claim that it is perfect. It is being precise about where its limits are. Write down what the reviewer checks, what it deliberately ignores, and its known blind spots, so a user trusts it where it is strong and double-checks where it is weak.

## What this reviewer does and does not do
Checks:    correctness smells, obvious security anti-patterns, unclear naming,
           missing error handling, secrets committed by accident.
Ignores:   style/formatting (your linter owns that), architecture decisions,
           anything needing product context it cannot see.
Limits:    can miss subtle logic bugs; not a substitute for a human reviewer
           on security-critical code. Diffs over the size budget are skipped.

5. Publish and pin the release

Share the skill so others can install it. In Claude Code, skills are filesystem-based: a personal skill lives in ~/.claude/skills/, a project skill in .claude/skills/, and to share one you can go through the plugin marketplace (/plugin marketplace add <owner/repo> then /plugin install <name>@<marketplace>, per the official docs). Tag the published version to match the release id from Chapter 14, so “the skill someone installed” and “the config that produced a review” point at the same version.

# project-local install
$ cp -r pr-reviewer .claude/skills/

# or publish via a plugin marketplace repo, then teammates run:
$ # /plugin marketplace add your-org/agent-skills
$ # /plugin install pr-reviewer@your-org-skills

$ git tag pr-reviewer-2.4.0 && git push --tags   # same coordinate as ch14

Only use skills from trusted sources, and tell the people who install yours the same thing: the official docs are clear that a skill carries instructions and code, so it should be audited like any other dependency.

6. Hand it off with a runbook

Publish a skill with no operations plan and the first incident has no owner. Write a short runbook that someone who did not build it can follow when things are on fire: how to roll back, who to page, where the logs are, what the SLA promises.

## Runbook
Rollback:  agentctl release set-active pr-reviewer-<prev>   (serves in seconds)
Page:      #pr-reviewer-oncall ; owner: <name>
Logs:      observability dashboard -> filter by version stamp
SLA:       availability 99%, p95 <= 60s, eval floor 0.85 (see ch14)
Known-bad: if false-flag rate spikes, halt canary first, investigate second.

Learned: delivering on its ownXiaoman can now finish its own work. On a confusing or oversized diff it says "could not review" instead of guessing, it only has enough scope to post a comment, every action is logged with a version, and it is packaged as a skill with a contract and a runbook that a teammate can install in a clean environment and trust.

How to verify

Install the published skill in a clean checkout and run one review end to end; confirm it posts comments and does nothing else.
Feed it an oversized diff and confirm it fails closed with “could not review.”
Confirm every checklist line below is backed by evidence (a passing test, a log line, a config), not just a tick.
Hand the runbook to someone who did not build it and have them do a rollback on their own.

Learned: letting it go outYou can now install the published release in a clean checkout, run one review end to end, and confirm it posts comments only and fails closed on an oversized diff. Then hand the runbook to someone who did not build it for an unaided rollback, and Xiaoman is ready to be sent out on its own.

Why it works

Everything in this book was one idea seen from fourteen angles: an agent is a system, not a prompt. The prompt sets its contract, the loop drives it step by step, context and skills are the memory it works from, evals watch whether it is right, guardrails stop it from doing dangerous things, observability lets you reconstruct what it did, and versioning lets you back out when something breaks. Packaging is just making that system portable and the contract readable to the next person. What you ship is trust, not cleverness.

The closing checklist

One line per chapter, and you only get to tick it once you have evidence:

ch0 mindset Treat the agent as a system you operate, not a demo you show. [ ]
ch1 setup Reproducible dev loop; anyone can run the reviewer locally. [ ]
ch2 prompt as contract The system prompt states the job, the limits, and the stop conditions. [ ]
ch3 agent loop The loop terminates and handles the no-tool and error cases. [ ]
ch4 context Only the diff and the repo facts it needs enter context. [ ]
ch5 SKILL.md The review procedure lives in a skill, not a giant prompt. [ ]
ch6 hooks / slash A clean entry point triggers a review on PR open. [ ]
ch7 subagents The deep security pass runs as a focused subagent. [ ]
ch8 Codex / interop It plays well with the other tools in the dev flow. [ ]
ch9 evals The suite is green on the current release, with a pass-rate floor. [ ]
ch10 observability Every review is logged with its version stamp. [ ]
ch11 error handling + guardrails Fails closed; sandboxed; comment-only token. [ ]
ch12 cost Per-review token and dollar cost is tracked and bounded. [ ]
ch13 deployment Deployed behind a stable entry point with health checks. [ ]
ch14 versioning + SLA Pinned config, regression gate, canary, one-flag rollback, published SLA. [ ]

Recap

You shipped a PR reviewer people can trust, with versions, published: it fails closed, scopes its token to commenting, logs every action with a version, documents its own limits, installs as a skill, and hands off with a runbook. The craft was never one trick. It was prompt, loop, context, tools, evals, guardrails, and operations working together as one system. That is AgentCraft.

Common pitfalls

Shipping the demo. A demo shows how it does when things go well; trust is about how it does when things go wrong. Close the gap before you publish.
A bloated SKILL.md. Detail belongs in referenced files that load on demand, not in the entry file that triggers every time.
Undocumented scope. If users do not know the limits, they will trust it where they should not.
Skill and config drifting apart. Tag the published skill with the same id as the release that produced your reviews.
No runbook. Publishing without an operations plan leaves the first incident with no owner.

The whole map is lit. You no longer watch over it. The day you log off, it does not merely echo the way it did back in the Waking Nook; it wraps up the day's work on its own and leaves a note: see you tomorrow. Almost full, not yet full, and now, just full. One last thing: publish it, so others can raise a Xiaoman of their own.

0123456789101112131415

Just lit Full Bloom · 16 / 16 lit

Sources

Anthropic: Agent Skills · official
Anthropic: open-source Skills repository · official