The Forge turns a rough idea into shipped work. Lesson 8 walked its seven steps. This lesson is about who does the work once the contract exists — and the one rule that defines the whole method: everything runs away-from-keyboard (AFK), and the human's only job is to watch. You will meet the three-agent crew, see how a top model hands a unit down to any other agent over a one-shot command line, and learn exactly when — and only when — the loop is allowed to stop and ask you something.
Picture a kitchen during a dinner rush. There is a head chef at the pass who never touches a pan — they read the tickets, decide which cook gets which dish, and taste every plate before it leaves. There are line cooks, each making one dish at a time. And there is a second taster at the pass who checks each finished plate — and crucially, that taster is never the cook who made it. You, the owner, are out front. You do not cook. You do not taste. You read the board on the wall that says what is firing, what is plated, and what got sent back.
That is exactly the shape of an AFK Loop Engineering run. Three kinds of agent do all the work between them, and you — the human — are out front reading the board. The whole point of the method is that you never execute anything: not the build, not the check, not even the final quality pass. You observe. The system is built so that watching is enough.
The three roles are the Orchestrator (the head chef at the pass), the Executor (a line cook), and the Validator (the second taster). They are all model-agnostic — any of them can be any AI model behind any command line. What never changes is the division of labor: the one who decides never builds, the one who builds never signs off on their own work, and the human never picks up a pan.
The one line to remember: in an AFK run, every action is taken by an agent. The human's entire contribution is observability — reading what happened. The loop interrupts you for exactly one reason, and we will name it precisely below.
Think of it like… air-traffic control. Controllers route every plane, and a separate set of eyes confirms each landing — but the airline executives upstairs never touch a control tower mic. They read the departures board. If a genuinely novel decision comes up that only a human with authority can make, then a controller picks up the phone. Until that moment, the board is the whole job. Unlike a real control tower, here even the "second set of eyes" is another agent — the human is one more step removed.
The Forge front-end is seven steps: grill → research (optional) → prototype (optional) → PRD → issues → implement → review. This lesson zooms into step six, implement, and into the agents that carry steps six and seven. By the time implement begins, the contract already exists: a GOAL.md with a measurable done-when, and a board of issues (tickets) with explicit BLOCKING relationships — a kanban. Implement is the loop running over that board, AFK.
The execution model layered on top of the loop is multi-agent. The single inner loop you already know — LEARN → ANALYZE → EXECUTE one bounded unit → VERIFY at the real boundary → DECIDE — does not change. What changes is that the steps are distributed: the Orchestrator owns LEARN/ANALYZE/DECIDE and routing; an Executor owns EXECUTE for one unit; the Validator owns VERIFY for that unit, at the real boundary, and is never the agent that produced the artifact.
"AFK" is the operating constraint, not a nice-to-have. The system is designed so that no step requires the human to act. The human reads LOOP-LOG.md, review.md, and the run status. The single exception — a genuine user-only fork — is a structured handoff, decision-ready, covered in the last section. Everything else, including the QA review, is performed by agents.
Three roles split the work, and each carries one hard rule about what it must not do. The rules are what keep the system honest — they are not style, they are structure.
Orchestrator · the control plane
Decides and routes. Never builds.
Captures scope, compiles the contract, breaks work into bounded units, hands each unit to an Executor, and routes the result to a Validator. It owns the plan and the verdict — the "what next" — but it never writes the code or the doc itself.
Never: produces the artifact it is supposed to be coordinating.
Executor · the builder
Builds exactly one bounded unit.
Receives a single Unit Contract — one ticket, one done-when, the files it may touch — and produces the artifact: the code, the doc, the config. One unit at a time. It does not decide what to do next and it does not get to declare itself correct.
Never: signs off on its own work — that is the Validator's job.
Validator · the proof gate
Proves the result at the real boundary.
Takes the finished unit and the contract, then runs the actual check — the test, the build, the request against the live thing — and reports pass or fail with evidence. A claim is never enough; it must produce proof.
Never: is the same agent that built the unit. Builder ≠ checker, always.
The structural rule
Independence is not a preference, it is the load-bearing wall. If the Executor could grade itself, "done" would mean "the builder thinks it is done" — exactly the failure the whole loop exists to prevent. The Validator being a different agent is what turns "I think it works" into "here is the proof it works."
An Executor never gets the whole goal. It gets a Unit Contract: one ticket id, the single done-when condition that closes it, the explicit file/area scope it is allowed to touch, and the relevant slice of context. This is what makes "one bounded unit" enforceable — the Executor cannot wander, because its contract only describes one piece of work.
Bounding the unit also bounds the blast radius if the Executor is wrong, and it is what lets the Orchestrator run several Executors in parallel on non-blocking tickets without them colliding. Tickets joined by a BLOCKING edge are serialized; independent tickets are fanned out.
A model asked to both build and verify its own output is structurally biased toward declaring success — it has already committed to one interpretation of the task. Handing verification to a fresh agent that only sees the contract and the artifact removes that bias. The Validator runs the check at the real boundary (the Proof Gate) and emits evidence, not an opinion. The Orchestrator then DECIDEs on the evidence.
This is the same discipline you met in the VERIFY lesson, now made organizational: the separation of "who built it" from "who proved it" is enforced by using two different agents, not by trusting one agent to be impartial.
Here are the three roles as a triangle. Read the arrows: the Orchestrator assigns work down to an Executor and routes the result across to a Validator; the Validator sends a verdict back up. The dashed line is the one path that does not exist — an Executor proving its own work. That missing arrow is the whole design.
A naive setup is a straight pipe: model builds, same model checks, human approves. That pipe has two corruptions — the builder grades itself, and the human is in the execution path. The triangle removes both. Verification is a separate vertex (a different agent), and the human is detached entirely, connected only by a read-only line. The control plane (Orchestrator) is the apex precisely because it must never be a worker — its job is to keep the two corruptions from creeping back in.
cli -pHere is the part that makes the crew cross-agent. The Orchestrator does not have to do every unit itself, and the agents do not all have to be the same model. A strong model sits at the top and delegates each unit down to whatever agent is best for it — by literally running that agent's command line in one-shot mode and reading what comes back.
"One-shot mode" means: you call the other agent's CLI with a single prompt, it does the unit, it prints a result, and it exits. No chat session to babysit. Most coding agents expose this as a -p ("print" / prompt) flag — for example claude -p "…", codex -p "…", kimi -p "…". That uniform -p shape is the wire between agents: the Orchestrator writes a prompt, sends it down the line, and gets text back. The roster — which agents are on call for this run — is declared at the very start, and it is agnostic by default: any capable agent can fill any seat.
Think of it like… a head chef who can phone any kitchen in the city. "Make me this one dish, here is the recipe and the constraints, send it back." Each kitchen speaks the same ordering protocol (the -p call), so the chef does not care which kitchen takes it. Unlike a phone call, the order and the reply are plain text the chef logs — so every hand-off is on the record.
-p one-shot call — lets the Orchestrator route any unit to any agent on the declared roster, then read the text back.Cross-agent delegation is deliberately the lowest-common-denominator interface: a shell command with a prompt that runs to completion and prints a result. No long-lived session, no proprietary protocol. That is why the roster is agnostic — anything you can invoke as tool -p "<prompt>" from a shell can be an Executor or a Validator. Models reachable only through an OpenAI-compatible proxy (for example Minimax via cliproxyapi) still present the same one-shot shape.
The roster is declared up front so the run is reproducible and so the Orchestrator can route by fit: a cheap fast model for boilerplate units, a stronger model for the gnarly one, a different model again as the Validator so independence holds across vendors too. The Orchestrator captures each call's prompt and the returned text into the run log, which is what the human later reads.
# the Orchestrator delegating one unit, then a DIFFERENT agent proving it codex -p "$(cat unit-PRJ-114.contract.md)" > out/PRJ-114.patch claude -p "Validate PRJ-114 against done-when. Run the gate. Report PASS/FAIL + evidence." \ > out/PRJ-114.verdict.md # builder = codex, validator = claude — never the same agent
Now follow a single unit through the crew. Pick an outcome below, then press Next to light up each step: the Orchestrator dispatches the unit, an Executor builds it, a different Validator proves it at the real boundary, and the Orchestrator decides. Watch where a failed proof loops back to a rebuild, and where a genuine user-only question peels off to the human.
Start here
A unit is ready to dispatch
Press Next to follow the proof passes path. Switch the outcome above to see a failed proof loop back to a rebuild, or a user-only fork reach the human.
The diamond is VERIFY at the real boundary, performed by the Validator. DECIDE is the Orchestrator reading the verdict: pass → close the unit and pick the next; fail → return the unit to an Executor with the failure evidence attached, then re-prove (proof is perishable, so it is re-run after any change); fork → the rare branch where the next move requires a human decision, which becomes a structured handoff. Note what is not here: a path where the Executor declares itself done, and a path where the human runs the check.
Zoom out from one unit to the whole run. The board below is the kanban of units, with four columns: Queued (waiting, maybe blocked by another ticket), Dispatched (an Executor is building it), Proving (a Validator is checking it at the real boundary), and Closed (proof passed). Each card shows the unit id, a who pill (which role currently holds it), and the agent assigned from the roster. Press a card's arrow to advance it one column — or drag it. The counts at the top stay honest.
This is the view a human would glance at to answer "how is the run going?" — without touching a single unit. You are reading the board, exactly as the method intends.
Think of it like… the order rail in the kitchen window. Tickets slide right: fired, cooking, plating, away. One glance tells the owner how slammed the line is — and the owner still never picks up a pan. Unlike a paper rail, the who pill also tells you which station (which agent) has each ticket right now.
Every unit is one object with a col field that can only be queued, dispatched, proving, or closed. The columns on screen are not the source of truth — the array of unit objects is. Moving a card just changes that one field and re-paints; the arrow button and a drag-and-drop drop both call the same moveTo(), so the two interactions can never disagree. This is the same shape as the issues board the Forge produces, minus the network.
The who pill is derived from the column: a unit in dispatched is held by an Executor, one in proving by a Validator, a closed one was signed off by the Orchestrator on the Validator's evidence. The column is the role — which is exactly why the same agent can never be both builder and checker for one unit.
function moveTo(id, col) { const u = units.find(x => x.id === id); if (!u || u.col === col) return; u.col = col; // single source of truth render(); // repaint every column from state }
Drive the lifecycle of one unit by hand. Press an event and watch the highlighted state move and the readout update. The buttons grey out the instant a move is not allowed from where you are — because the lifecycle is a fixed set of legal transitions. Notice you can never jump from built straight to closed: a unit must be proven by the Validator first. And notice that the only button a human ever presses lives behind the fork state — everywhere else, the agents drive.
The clay-filled node is where the unit is now. Faint nodes are unreachable from here.
Current state
QUEUED
The unit is on the board, waiting. The Orchestrator will dispatch it to an Executor.
Allowed transitions
Run log (this is what the human reads)
Everything the sim does is driven by this table. The buttons read it to decide what to enable; pressing one looks up machine[state][event] and moves there. The illegal move you cannot make — BUILD → CLOSED — simply is not in the table, so it cannot be taken. That missing entry is the rule "a unit must be proven before it closes."
const machine = { QUEUED: { dispatch: 'BUILD' }, BUILD: { submit: 'PROVE' }, // Executor → Validator PROVE: { pass: 'CLOSED', fail: 'BUILD', // proof gate verdict fork: 'FORK' }, // rare: needs a human FORK: { resolve: 'CLOSED' }, // human decision applied CLOSED: {} // terminal — proven & done };
The run log the sim prints is a stand-in for LOOP-LOG.md: every event, who took it, and the resulting state. That log is the human's window into an AFK run.
Not every capability is equally safe to hand an autonomous crew. So the run declares authorization tiers — switches for what the agents are allowed to do without a human. Crucially, some capabilities depend on others: you cannot let an agent push to a remote until it is allowed to commit; you cannot grant production access until staging access is on. Turn a higher tier on while its prerequisite is off and the panel warns you that the grant won't take — exactly like a feature flag whose support is missing.
Flip the switches below. Turn on push-remote while commit-local is off and watch the warning slide in. The summary line always reads back the set of capabilities that are actually live. The one tier that is special is user-gated: it is the explicit "ask a human" switch — the only sanctioned path for the loop to involve you.
Think of it like… keys on a ring, where some doors are behind others. The key to the safe is useless until you are first allowed through the office door. A good keyring tells you up front: "this one needs the office key first."
Capabilities actually granted
Tiers are a flat boolean record; a second object lists each tier's prerequisites. After any toggle the editor checks each enabled tier against its requires list. A tier that is on but missing a prerequisite is unsatisfied — it draws the inline warning and is excluded from the "actually granted" set the summary reads back. This is the same dependency-resolution shape as the Forge's BLOCKING edges between tickets: you cannot honour the dependent until its blocker is satisfied.
The point for an AFK run: authorization is declared and machine-checked, so the crew can run unattended within exactly the envelope you set — and the human is involved only through the explicit user_gated tier, never as an accidental dependency of getting normal work done.
const requires = { read_repo: [], // base tier commit_local: ['read_repo'], // must read before it writes push_remote: ['commit_local'], // must commit before it pushes deploy_prod: ['push_remote'], // sharpest grant, deepest chain user_gated: [] // the explicit "ask a human" switch };
This is the thesis in one picture. The loop on the left is a closed cycle the agents run among themselves — learn, analyze, execute, verify, decide, around and around. The human is the figure on the right, outside the cycle, connected by a single read-only line to the artifacts the loop emits. The human reads; the human does not reach in.
The human is not in the execution path (does not build), not in the verification path (does not run the check), and not in the QA path (does not perform the review). Even the autoreview QA at step seven is an agent that emits review.md as an observability report — a thing to read, not a task to do. The human's read-only line touches three artifacts: LOOP-LOG.md (the running narrative), review.md (the QA pass), and the run status (which units are where). If the human wants to change direction, they change the contract (GOAL.md) — they still do not reach into a running turn.
So what does "reading the board" actually look like? Like this. The four tiles up top are the run's vital signs — how many units are proven, how fast the loop is turning, how many proofs are failing, and how much the crew is spending. Below them, a table lists each in-flight unit with a colored badge: proven, building, blocked, or fork (waiting on a human). Hit Refresh for a new reading, or turn on Live to watch it tick — the way you would glance at an overnight run over coffee.
This is the entire human job rendered as a screen: read the rollup, scan the rows, spot the one unit sitting in fork that wants a decision. You never click into a unit to do it — you read.
Run status — PRJ "checkout-v2" · AFK loop
orchestrator: top model · roster: codex / claude / kimi · reading LOOP-LOG.md
| Unit | State | Proof-fails | Last event |
|---|
A single array of unit objects drives both the table and the rollup pill. Each tick perturbs the metrics within realistic bounds and recomputes each unit's state, then re-derives the overall banner: any unit blocked → red, any in fork → "needs you", else healthy. The KPI deltas are coloured by meaning, not arrow direction — a falling cadence is good (green) even though the arrow points down; a rising proof-fail rate is bad (rust) even though its arrow points up.
This is the readable face of LOOP-LOG.md + the run status. In a real run these numbers come from the log the Orchestrator writes as it dispatches and the Validators report. The "fork" badge is the single cell that ever calls for a human — everything else is just there to be read.
If the human never executes, when does the loop involve you? Exactly once per occasion: when the next move is a genuine user-only decision — something no agent has the authority or the information to settle. A new legal commitment. Spending real money past a cap. Choosing between two valid product directions. Picking which of two irreversible paths to take.
When that happens, the loop does not silently guess and it does not silently stall. It packages a handoff: a short, decision-ready note that states the fork, the options, the trade-offs, and a recommendation — then it waits. You read it, you decide, you hand the decision back, and the crew resumes. That is the only sanctioned interruption, and it is still observability-shaped: you are reading a prepared decision, not doing the work.
Think of it like… a surgical team that pauses to call the family only for a true consent decision — never to ask which clamp to use. Everything routine they handle; the one irreducibly-human call, they tee up cleanly and wait for. Unlike the operating room, here the pause is logged and the recommendation written down, so the decision is fast and on the record.
A fork escalates to the human only if it is both irreducible (no proof, no policy, and no contract clause can settle it) and authority-bound (it needs a human's mandate — money, legal, or a values call). Anything an agent can settle with evidence or by re-reading GOAL.md is handled in the loop. The handoff is deliberately decision-ready: it does the analysis for you and ends with a recommendation, so your part is a yes/no/choose, not a research project. This keeps the interruption rare and short — and it keeps the human's role observability-shaped even at the one moment they act.
Operationally, the escalation rides the user_gated tier from the previous section: if that switch is off, the loop must not invent an interruption; it records the blocker and stops the affected branch instead, leaving the rest of the board running.
None of this is magic. The crew is a handful of shell calls and a declared roster. Here is the shape of an AFK implement run: the Orchestrator reads the contract, loops over ready units, dispatches each to an Executor via that agent's -p call, sends the result to a different agent to prove, and writes everything to the log the human reads.
# roster declared up front — agnostic by default EXECUTORS=("codex" "kimi") # build agents VALIDATOR="claude" # proof agent — NEVER one of the executors while read -r unit; do # LEARN/ANALYZE: next ready (unblocked) unit agent="${EXECUTORS[$((RANDOM % ${#EXECUTORS[@]}))]}" # EXECUTE: dispatch one bounded unit, one-shot "$agent" -p "$(unit_contract "$unit")" > "out/$unit.artifact" # VERIFY: a DIFFERENT agent proves it at the real boundary "$VALIDATOR" -p "Prove $unit vs done-when. Run the gate. PASS/FAIL + evidence." \ > "out/$unit.verdict" # DECIDE: read the verdict, log it, close or requeue decide "$unit" >> LOOP-LOG.md done < <(ready_units GOAL.md) # review: an independent QA agent emits an observability report "$VALIDATOR" -p "QA the whole change vs GOAL.md. Emit review.md." > review.md
The exact orchestration lives in the loop-engineering skill. To find the AFK and cross-agent rules and the forge flow:
# the AFK + multi-agent contract and the 7-step forge flow cat ~/.claude/skills/loop-engineering/forge-flow.md grep -rn "cli -p\|Validator\|Orchestrator\|AFK\|observability" \ ~/.claude/skills/loop-engineering/
The three invariants the code enforces: (1) the Validator is never in EXECUTORS; (2) every unit passes through a proof before decide can close it; (3) the only write the human makes is to GOAL.md — never into a running turn. The roster is the one knob you set per run; everything else follows from the contract.
Tie it together with one unit on the "checkout-v2" run, ticket PRJ-114 · add a coupon field to checkout. Watch where each role acts and where you do not.
PRJ-114 is unblocked, and writes its Unit Contract — the one done-when ("a valid coupon reduces the total; an invalid one shows an error"), the files it may touch, the context slice.codex -p "<contract>". Codex writes the field, the validation, and a test, then prints a patch and exits.claude -p "Prove PRJ-114 at the real boundary…". Claude applies the patch, runs the checkout against a real coupon and a bad one, and reports FAIL — the invalid case throws instead of showing an error, with the stack trace as evidence.PRJ-114 to an Executor with the evidence attached. Codex fixes the error path; the Validator re-proves (proof is perishable) and now reports PASS with both cases passing.PRJ-114, appends the whole trace to LOOP-LOG.md, and picks the next unit.PRJ-114 went red once and then closed green. You did nothing — and nothing needed you. No user-only fork came up, so the loop never asked.Count the human actions in that example: zero. That is a healthy AFK turn. The only thing that would have pulled you in is a genuine user-only fork — and there wasn't one.
# LOOP-LOG.md — PRJ-114 10:02 orch ANALYZE PRJ-114 ready (no blockers) · contract written 10:02 exec EXECUTE codex -p → out/PRJ-114.patch 10:05 valid VERIFY claude -p → FAIL: invalid-coupon path throws (trace attached) 10:05 orch DECIDE requeue PRJ-114 with evidence 10:07 exec EXECUTE codex -p → out/PRJ-114.patch (v2) 10:09 valid VERIFY claude -p → PASS: both cases green 10:09 orch DECIDE close PRJ-114 · next: PRJ-115
Note the alternating exec/valid actors and that VERIFY never shares an agent with the EXECUTE above it. The human's name appears nowhere in this trace — which is the success condition.
Three quick questions. Pick one answer in each — it grades on click, and tells you why.
In an AFK run, what is the human's role?
Why must the Validator be a different agent than the Executor?
What does the Orchestrator do, and not do?