Step 14 · In Practice · In Practice · Loop Engineering EN PT

Module 5 · In Practice · Lesson 14

In practice: one ask, end to end

One rough idea, driven all the way to shipped — Forge to set it up, the loop to build it, the toolbelt to prove it — and the whole time the human only watched. This is the previous thirteen lessons doing one real job, then handing you the keys.

Plain-language first; open any panel for the precise version.

The big idea: a rough idea, shipped, while you watched

Everything in this course has been one piece at a time: what a loop is, how to scope it, the five steps, the gates, the Forge front-end, the toolbelt, the course engine. This lesson runs all of it at once, on one real ask, from a sentence to a shipped change — and you do not touch the keyboard once it starts.

Here is the ask we will follow the whole way down: "RHG needs a health endpoint and a tiny status page, behind login, shipped safely." That is vague on purpose — it is how real asks arrive. By the end it becomes a measurable contract, a board of tickets, code that an agent wrote, a proof that the code actually works, and a write-up of what shipped. None of that is mocked: every "done" on this page is a real check that really passed.

The shape of the run is two halves. Forge is the front-end that turns the fog into a plan — seven steps: grill, research, prototype, PRD, issues, implement, review. The loop is the engine inside the "implement" step that builds each ticket and refuses to call it done until a real check passes. Wrapped around both is one rule that makes it safe to walk away: everything runs AFK (away from keyboard), and your only job is observability — you read the log, you do not drive a pass.

Think of it like… commissioning a kitchen renovation while you are at work. You don't lay tile. You hand over a brief, the crew turns it into a plan with a sign-off at each stage, they build it, an independent inspector checks each stage against the brief, and you get photos on your phone all day. You only step in for a decision only you can make — "which countertop?" — never to hold a hammer.

The loop (the engine)

Five steps, run over and over until the contract is met: LEARN (observe the real state) → ANALYZE (classify the gap, pick exactly ONE bounded unit) → EXECUTE (build that one unit) → VERIFY at the real boundary (the Proof Gate — run the actual check, never a claim and never a mock) → DECIDE (advance, retry, or escalate). The Proof Gate is the non-negotiable: a unit is "done" only when a command run against the real artifact returns the expected result.

Forge (the 7-step front-end)

For a raw or vague ask you run Forge first: 1 grill (self-debate that converges the scope), 2 research (optional — the Bright Data CLI pulls real facts into research.md), 3 prototype (optional — real evidence the approach works), 4 PRD (the product spec), 5 issues (tickets with BLOCKING relationships — a kanban) plus GOAL.md (the durable ultragoal contract), 6 implement (the AFK loop: an Executor builds each ticket, an independent Validator proves it against GOAL.md — the Validator is never the builder), 7 review (AFK QA that emits review.md as an observability report).

AFK + observability

Every step above runs unattended. The human's only role is observability: read LOOP-LOG.md, review.md, the status readout. You never execute anything — not even the QA. You are blocked only on a genuine user-only fork, surfaced as a decision-ready handoff. Cross-agent delegation is via headless cli -p; web evidence is always the Bright Data CLI (never WebSearch/WebFetch, never the Bright Data MCP); GUI verification via Computer Use is non-blocking and accessibility-only.

Control-plane patterns this run follows: steipete/agent-scripts. The durable-goal discipline behind GOAL.md: jxnl/dots (ultragoal).

The whole run in one picture

Read it left to right. The seven Forge steps run in order; step 6 (implement) is where the loop spins, one pass per ticket; step 7 (review) emits the report. Under the whole line runs the observability rail — the human reads it, never steps onto it, until the one decision fork pulls them up.

Seven steps across the top; the loop spins inside step 6; the observability rail is where the human lives. The only upward arrow is the handoff.

We will now walk every numbered beat of this picture as a live, clickable thing — a flowchart you step through, a plan you open phase by phase, a board that moves itself, the tiers that stayed locked, a replay of one pass, the recovery when a proof failed, the improve-the-prompt branch, and the shipped PR. Nine interactive widgets, one ask.

Walk the run end to end

Same run, now as a decision you can step through. Pick a path and press Next: the happy path runs all seven steps to a shipped result; the other paths show the two forks that can pull you off the autopilot — a proof that failed (the loop recovers itself) and the one genuine user-only decision (the loop stops and hands you a choice).

Trace run:

Read top → bottom. The Proof Gate is the diamond: a pass ships, a fail loops back to EXECUTE, and a genuine user-only choice branches to the handoff.

Step 1 of 7

Start here

A rough ask lands

Press Next to drive the happy path from the ask to a shipped result. Switch the run above to see the two forks that pull you off the autopilot.

The gate is a command, not an opinion

In the loop, EXECUTE never gets to declare victory. VERIFY runs the check named in GOAL.md against the real artifact — here curl -s -o /dev/null -w "%{http_code}" localhost:8080/health — and only a literal 200 advances the pass. A failing gate does not stop the run; it feeds the failure back into ANALYZE and the loop tries again (that is the "fail → retry" arrow). The run halts for a human only at the handoff fork, which is reserved for a choice no agent should make alone.

Anatomy of the ask: from a sentence to a contract

The first thing Forge does is grill the idea — it argues with itself until the fog turns into something measurable. "Shipped safely" is not testable; curl localhost:8080/health → 200 is. Below is each Forge step for our ask, and the concrete proof it leaves behind. Notice that every step hands the next one a real artifact, so nothing is taken on faith.

1 · grillConverge the scopeSelf-debate resolves "behind login" and "safely" into concrete behavior.→ a sharp, two-sentence scope
2 · researchPull real factsThe Bright Data CLI fetches the current health-check convention into research.md.→ research.md (cited)
3 · prototypeEvidence it worksA throwaway route returns 200 locally — proof the approach is sound.→ a running spike
4 · PRDThe specProblem, scope, non-goals, and the done-when, written down once.→ PRD.md
5 · issuesThe board + contractTickets with BLOCKING links become a kanban; GOAL.md records the durable done-when.→ issues + GOAL.md
6 · implementThe AFK loopExecutor builds each ticket; an independent Validator proves it at the gate.→ proven commits
7 · reviewObservability QAAFK QA reads the whole run and emits a report you read, not run.→ review.md

The grill's output (the sharpened scope)

"Behind login" → the endpoint is public (load balancers must reach it unauthenticated) but the status page requires a session. "Safely" → ship behind a flag, default off, with a one-flip rollback. The done-when stops being a feeling and becomes a command.

The contract it wrote — `GOAL.md`

This is the ultragoal artifact: agent-, CLI-, and model-agnostic. Universal activation is simply running a durable GOAL.md under the loop (a vendor's native goal feature is one optional example, never required).

GOAL.md — the durable contract every Validator checks against

<goal>   Add a /health endpoint and a session-gated status page to RHG. </goal>
<context> repo: rhg-api · service runs on :8080 · ship behind flag status_page_v1 </context>
<constraints>
  - /health is unauthenticated (the load balancer probes it)
  - /status requires a valid session cookie
  - flag defaults off; rollback = flip the flag, no deploy
</constraints>
<verification>
  - curl -s -o /dev/null -w "%{http_code}" localhost:8080/health   → 200
  - curl -s -o /dev/null -w "%{http_code}" localhost:8080/status   → 302 (no session)
</verification>
<done-when> both verification commands return their codes on the real service </done-when>

ultragoal / durable-goal discipline: jxnl/dots.

The plan: the PRD as phases with exit bars

The PRD is not a wall of prose — Forge shapes it as a sequence of phases that go in order, and a phase only advances when it clears its exit bar, the proof that it is safe to continue. The strip is the map of our run; click a phase to open its goal, its tasks, the exit criteria, and the risks the loop is watching for.

Think of it like… renovating room by room. You don't empty the whole house onto the lawn — you finish one room, check nothing is broken, then start the next, and you keep one working tap until the very end so you can always wash your hands.

Ask RHG health endpoint + status page Window one AFK evening Driver Forge → the loop (Orchestrator delegates via cli -p)

Progress 2 of 4 phases complete

Click a phase — or focus the bar and use ← → — to open its card.

Phase 1 · Done

Frame the ask

grill → research → PRD

Goal: Turn a vague sentence into a spec with a measurable done-when, before anyone writes a line of product code.

Tasks

Grill "behind login" and "safely" into concrete behavior
Pull the current health-check convention into research.md via the Bright Data CLI
Spike a throwaway route that returns 200 — proof the approach holds
Write PRD.md: problem, scope, non-goals, done-when

Exit criteria

Done-when expressed as runnable commands, not adjectives
Every non-obvious claim in the PRD cited
Prototype actually returned 200 locally

Risks & mitigations

MedScope still fuzzy after one grillAdjectives sneak back in. Mitigation: reject any done-when that can't be a command.

LowStale convention from memoryHealth-check norms drift. Mitigation: Bright Data CLI fetches the current source, never parametric recall.

Phase 2 · Done

Decompose into a board

issues + GOAL.md

Goal: Break the PRD into small tickets with explicit BLOCKING links, and write the durable GOAL.md the Validator will check every unit against.

Tasks

Cut tickets: route, flag, status page, session guard, tests
Wire BLOCKING edges (the status page blocks on the route)
Write GOAL.md with verification commands and the done-when
Seed the kanban: Blocked · Ready · Building · Proven

Exit criteria

Every ticket is one bounded unit with its own done-when
No ticket is "Ready" while a blocker is open
GOAL.md verification is copy-paste runnable

Risks & mitigations

MedTickets too big to prove in one passA unit you can't verify cleanly. Mitigation: split until each has a single, runnable gate.

LowHidden ordering between ticketsImplicit dependencies stall the loop. Mitigation: make every dependency an explicit BLOCKING edge.

Phase 3 · In progress

Implement under the loop

the AFK loop

Goal: Build each ticket and prove it at the real boundary. The Orchestrator delegates a unit via cli -p; an Executor builds it; an independent Validator runs the gate. The Validator is never the builder.

Tasks

Executor implements the /health route behind the flag
Validator runs curl … /health and asserts 200
Computer Use verifies the status page renders — non-blocking, AX-only
Each green unit moves itself to Proven on the board

Exit criteria

Both GOAL.md commands return their codes on the real service
No unit marked done on a claim — only on a passing gate
Every retry and failure is in LOOP-LOG.md

Risks & mitigations

HighA proof fails mid-runThe route 500s under the flag. Mitigation: the gate catches it; the loop feeds the failure back to ANALYZE and retries (see §9).

MedBuilder grades its own workSelf-validation hides bugs. Mitigation: the Validator is a separate agent that never wrote the code.

LowGUI check blocks the runA modal steals focus. Mitigation: Computer Use is accessibility-only and never raises the app or moves the cursor.

Phase 4 · Planned

Review & ship

review.md

Goal: An AFK QA pass reads the whole run and emits review.md as an observability report — then the change ships behind its flag. The human reads the report; the human does not run the QA.

Tasks

QA agent re-derives each done-when independently
Emit review.md: what shipped, what was proven, residual risk
Merge behind status_page_v1=false; canary; ramp
Produce this visual-teach course as the convergence artifact

Exit criteria

review.md confirms every GOAL.md command passed
Rollback is a single flag flip, no deploy
The run is fully readable after the fact in LOOP-LOG.md

Risks & mitigations

MedQA rubber-stamps the runA friendly reviewer. Mitigation: QA re-runs the gates itself rather than trusting the log.

LowFlag left on after rollbackDead path lingers. Mitigation: a follow-up ticket deletes the flag after two weeks.

An exit criterion is a gate, not a feeling

"Implement went well" is not a gate; "curl … /health returns 200 on the real service" is. A phase cannot advance until every box is a passed check — which is exactly the loop's Proof Gate applied at the phase scale. The plan and the loop are the same discipline at two zoom levels.

The milestone bar is a live state machine

Each segment carries done, active, or todo; selecting one swaps the visible role="tabpanel". In a real run these states are driven by the ticket board, so the bar reflects reality rather than the plan as written.

Each phase advances only through its exit gate. The flag is off through phases 1–3 so rollback is a single flip; ship is where it goes live.

The issue board moves itself

Forge's "issues" step turns the PRD into a board of tickets with four columns — Blocked (waiting on a dependency), Ready (no open blockers), Building (an Executor has it), and Proven (a Validator passed its gate). During the AFK run the loop advances cards itself; here you can drive them. A ticket always lives in exactly one column, and the counts stay honest the whole time.

Notice the locked card: a ticket with an open BLOCKING dependency cannot move until its blocker is Proven — the same rule the loop obeys, so it never builds something whose foundation isn't there yet.

Think of it like… sticky notes on a wall. A note never sits in two places; you peel it off "ready" and stick it under "building". Nothing is lost, and the wall always tells you how much is left — except the notes that are taped down until the one above them is finished.

Blocked0

Ready0

Building0

Proven0

One array is the source of truth

Every ticket is one object with a col field — blocked, ready, building, or proven — and an optional blockedBy. The columns on screen are not the truth; the array is. Both the arrow button and a drag-and-drop land in the same moveTo(), so the two interactions can never disagree, and a card physically cannot appear in two columns.

BLOCKING is enforced in one place

A move out of blocked is refused while the blocker isn't proven; when a blocker reaches proven, its dependents auto-promote to ready. That is the kanban rule from Forge's "issues" step, and it is why the loop never picks up a unit whose foundation is missing. No framework — one array, one render, native drag events.

moveTo — the only place a ticket's column changes (blocking enforced)

function moveTo(id, col) {
  const t = tickets.find(x => x.id === id);
  if (!t || t.col === col) return;
  if (t.blockedBy && !isProven(t.blockedBy)) return;  // BLOCKING: refuse the move
  t.col = col;                 // single source of truth
  if (col === 'proven') promoteDependents(id);  // unblock what waited on it
  render(id);                  // repaint everything from state
}

Authorization tiers in action: what stayed user-gated

"Runs AFK" does not mean "does anything it likes." Each capability the run can use sits in a tier: most are auto (the loop may use them unattended — read files, run the build, run the gate, drive a non-blocking GUI check), and a few are gated (they require a human, surfaced as a handoff). The panel below is the authorization map for our run. Flip a capability on and watch it either go live or raise a red warning that it can't run without its gate.

The teaching point: a gated capability turned "on" by the loop alone does not actually fire — it shows as blocked, exactly like a feature flag switched on while its dependency is off. That is what keeps an autonomous run safe.

Think of it like… the light switches in a building. The reading lamps are on a circuit you can flip freely. But the main breaker for the server room needs a key — flip its switch without the key and a tag lights up: "needs sign-off first."

read & buildauto

Read the repo, run the build, run the test suite. The loop's bread and butter — never needs a human.

tier: auto — no gate

run the Proof Gateauto

Run the curl … /health check against the real service. The Validator does this every pass, unattended.

tier: auto — no gate

verify GUI (Computer Use)auto

Read the status page via the accessibility tree to confirm it renders. Non-blocking, AX-only — never moves the cursor.

tier: auto — non-blocking

flip the prod flag to 100%gated

Turn status_page_v1 on for all users. This is a launch decision — the loop must hand it to a human.

requires: human sign-off (handoff)

force-push to maingated

Rewrite shared history on the default branch. Destructive and irreversible — always a human's call.

requires: human sign-off (handoff)

What the loop may do right now

One map decides what is "actually live"

Each capability has a tier. The loop may switch on anything, but the effective set — what actually fires — excludes any gated capability that has not been authorized by a human. A gated switch flipped by the loop alone renders the inline warning and is reported as blocked, never delivered. The "Get sign-off" fix here stands in for the real handoff: it records the human authorization that clears the block.

const tier = { read_run:'auto', proof_gate:'auto', gui_verify:'auto',
               enable_flag:'gated', merge_main:'gated' };

function effective(on, authorized) {
  // on by the loop AND (auto OR a human authorized it) = actually fires
  return Object.keys(on).filter(id =>
    on[id] && (tier[id] === 'auto' || authorized[id]));
}

This is the safety rail under "everything runs AFK": autonomy over the auto tier, a hard stop on the gated tier. The human never executes the auto work and is pulled in only for the gated forks.

Replay one AFK pass, state by state

Zoom all the way in: a single pass of the loop on the /health ticket. Press the events and watch the loop move through its states. The whole point is that you can't skip around — you can't VERIFY before you EXECUTE, and once a pass DECIDEs to advance, that unit is done. Buttons grey out the moment a move isn't allowed from where you are.

Think of it like… a board game where only certain squares connect. You roll, you move — but the board won't let you jump to a square there's no path to. The greyed-out buttons are the squares you simply can't reach from where you stand.

The clay-filled node is the current step. Faint nodes are unreachable from here.

Current step

LEARN

Observe the real state of rhg-api: no /health route exists yet, flag is off.

Allowed moves

Pass log (this is what LOOP-LOG.md records)

The pass is a finite-state machine

One current state, a fixed set of events, and a table mapping (state, event) → nextState. Events not in the table for the current state are disabled, so an illegal move (verifying before executing) is impossible by construction — the same reason the loop never claims done without running the gate. VERIFY is the only state with two exits: advance to DONE on a pass, or retry back to ANALYZE on a fail.

const loop = {
  LEARN:   { analyze: 'ANALYZE' },
  ANALYZE: { execute: 'EXECUTE' },
  EXECUTE: { verify:  'VERIFY' },
  VERIFY:  { advance: 'DONE', retry: 'ANALYZE' },  // gate decides which
  DONE:    {}                                       // terminal
};

When a proof failed: the recovery

Halfway through the run, a gate went red. The Validator ran curl … /health and got a 500, not the 200 the contract demands. This is the moment the whole design is built for: nothing shipped, no one was woken up, and the loop recovered itself. Below is the report the run produced — a timeline of what happened, the root cause dug out with five whys, the blast radius, and the fixes the loop applied — and you only read it.

Think of it like… a smoke alarm that goes off while the kitchen is still fine. The alarm is the win: it caught the problem before the fire, the sprinkler handled it, and the report tells you to move the toaster — not to rebuild the house.

GATE-FAIL

The pass where /health returned 500

Caught atPass 4 · VERIFY

RecoveredPass 6 · VERIFY → 200

Shipped broken?No — gate blocked it

Human woken?No — loop self-recovered

Reported inLOOP-LOG.md

9·a

Timeline of the failed pass

Read it top to bottom. Olive dots are routine, clay is a warning, red is the failed gate, green is recovery. The failure surfaced exactly where it should — at VERIFY, before anything shipped.

Pass 4
Executor builds the /health route
An agent adds the handler behind status_page_v1 and reports it complete.
execute
Pass 4
VERIFY: gate returns 500, not 200
The Validator runs curl … /health. The handler throws on a nil config read. The claim "complete" is overruled by the boundary.
gate fail
Pass 5
DECIDE: retry, not ship
The failure feeds back into ANALYZE. The loop does not advance and does not escalate — a failed gate is in-scope for the loop to fix.
retry
Pass 5
Root cause found in the log
The handler read the flag config before it was loaded. ANALYZE narrows the fix to one bounded change.
diagnosed
Pass 6
Fix executed and re-verified
Executor guards the config read; the Validator re-runs the gate. curl … /health → 200 on the real service.
recovered
Pass 6
Ticket moves itself to Proven
Only now — on a real 200, not a claim — does the card advance. The run continues, untouched by a human.
proven

The failure surfaced at VERIFY and was contained there. Two passes later the same gate returned a real 200.

9·b

Root cause — five whys

Keep asking "but why did that happen?" until you reach something you can actually fix. "The gate returned 500" is the symptom. The fifth answer is the one worth fixing — and it points at a missing test, not a person.

Why did the gate fail?
The /health handler returned a 500 instead of 200.
Why did the handler 500?
It threw reading a nil flag config.
Why was the config nil?
The handler read the flag before the config loader had run on cold start.
Why wasn't that caught earlier?
The ticket's done-when checked a warm process; nothing exercised the cold-start path.
Root cause · the gate had a blind spot
The verification command hit an already-initialized server, so the order-of-init bug was invisible to it. The fix is to add a cold-start case to the gate — the proof, not the code, was incomplete.

The gate did exactly its job

A claim of "complete" met a boundary that disagreed, and the boundary won. Nothing shipped on a lie because the loop never advances on a claim — only on a passing gate. The cost of the bug was two extra passes of compute, paid by the machine, with the human asleep.

Blameless, and it hardens the gate

The fix isn't "the agent wrote a bug"; it's "the contract's verification missed a path." Strengthening the gate (add the cold-start check) makes the system better, so the same class of failure can't slip past next time.

9·c

Blast radius & the fixes

The damage in numbers, and the good news at the end. Then the action items — check them off as they ship; the bar tracks progress.

Extra loop passes

Broken builds shipped

Humans paged

100%

Caught at the gate

0 of 3 done

Add a cold-start case to the /health verification in GOAL.mdOwner Validator agent · this run P1
Guard the flag config read so a nil returns a safe defaultOwner Executor agent · this run P1
Record the gate blind-spot lesson in review.md for the next runOwner QA agent · phase 4 P2

The improve-the-prompt branch: tune the contract, watch it rebuild

The loop can improve two different things. Usually it improves the artifact — the code. But when the artifact keeps missing in the same way, the smarter move is to improve the prompt that drives the run: the instruction handed to the next agent. This tuner is that branch made tangible. Turn the knobs on the left — how strict the gate is, which boundary to verify, who the executor is, whether to demand a cited source — and the assembled instruction on the right rebuilds itself, word for word, so you see exactly how each lever rewrites the request before it is sent.

Think of it like… a coffee machine with dials for strength, size, and milk. You don't re-plumb the machine each time — you turn a dial and the next cup changes. Here the "cup" is the instruction the loop sends, and every dial rewrites it instantly.

Assembled instructionlive

One pure function maps knobs → instruction

Each control updates a shared state and calls render(); the preview is whatever buildPrompt(state) returns — nothing writes to it directly. Because the builder is pure (same state → same string), the instruction is reproducible and the changed line simply flashes. This is the literal mechanism of the loop's "improve" step when it targets the prompt instead of the code: change the contract, regenerate the instruction, run again, keep it only if the gate result improves.

Improve the artifact OR improve the prompt

The loop converges by improving whichever surface is the bottleneck. A flaky build → improve the code. A run that keeps verifying the wrong thing (the §9 blind spot) → improve the prompt and the gate. Both branches end the same way: re-run, re-prove, decide. The human still only observes.

The shipped result: a PR write-up

The run is done; here is what came out. A good result tells a story, not a diff dump. Before you read a line of code you should know the motivation (why we touched this), get a quick file tour (what moved and why), see the focus (the one subtle part worth a second look — the cold-start fix from §9), and trust the rollout (how it goes live, behind its flag, with a one-flip rollback). The four buttons are stops on that walk.

Think of it like… a tour guide, not a map dump. A map shows every street at once; a guide walks you through, points at the one statue that matters, and tells you where the exit is.

rhg/rhg-api · pull request #318

Add /health endpoint and a session-gated status page

Proven by the loop +204 −12 5 files flag: status_page_v1

Why RHG needed this at all.

The load balancer had no reliable way to tell whether an RHG instance was actually serving — it probed the home page, which could 200 while the app was wedged. We needed a cheap, unauthenticated /health the balancer can trust, plus a small session-gated /status page for operators to eyeball recent checks.

The pain

No trustworthy liveness signal; the balancer kept traffic on a wedged instance.

The goal

A public /health returning 200, and a /status page that 302s without a session.

Why now

The scope was sharp, the done-when was a command, and the whole thing fit one AFK evening.

Five files moved. Here's each one and why.

addapi/health.goThe new unauthenticated route. Returns 200 with a tiny JSON body once the config loader has run.
addapi/status_page.goThe session-gated page. Redirects (302) to login when there's no valid session cookie.
editconfig/flags.yamlRegisters status_page_v1, defaulted to false — zero behavior change until it's flipped.
editapi/router.goWires both routes behind the flag. The cold-start guard from §9 lives here.
testapi/health_cold_start_test.goExercises the cold-start path that the original gate missed. The test is the hardened spec.

The one tricky bit worth a careful read — the bug from §9.

Read this slowly: the handler must not read the flag config before the loader has populated it. On a cold start that read returned nil and the handler threw a 500 — the exact failure the gate caught. The fix guards the read with a safe default.

api/router.go — the cold-start guard

// guard the flag read: a cold start must not 500
func healthEnabled() bool {
  cfg := flags.Current()
  if cfg == nil {
    return true   // /health is safe to serve even before flags load
  }
  return cfg.Enabled("status_page_v1") || cfg.healthAlwaysOn
}

If you review one hunk, make it this one — correctness lives here; everything else is wiring.

How it goes live without waking anyone.

Ship dark behind the flag
Merge with status_page_v1=false everywhere. /health serves (it's safe-on); /status stays dark. Safe to merge now.
Point the load balancer at /health
Switch the probe to /health in staging, watch one hour, then prod.
Canary the status page at 5%
Flip status_page_v1 for 5% of operator sessions. Watch status.render.errors stay at zero.
Ramp 5% → 100% — the gated step
Each step is clean for an hour before the next. Flipping to 100% is the user-gated decision from §7 — the loop hands it to a human at the handoff.
Rollback plan
Flip status_page_v1=false — instant revert, no deploy. Delete the flag in a follow-up PR after two weeks.

Stop 1 of 4 · Motivation

The reviewer's real questions, front-loaded

A diff answers "what changed?" but a reviewer asks "why?", "where do I look?", and "will this break prod?". The four-stage shape answers exactly those, so the right scrutiny lands on the load-bearing hunk (the cold-start guard) instead of spreading thin across renamed variables. The rollout earns trust because it's flag-gated, canaried on named metrics, with a no-deploy rollback — the reviewer approves a plan, not a leap.

How it was built: the suite map

Everything you just watched runs on one harness and five distributed skills. The loop-engineering harness is the spine — it runs the loop and orchestrates the AFK crew. Around it sit five skills, each owning one job: ultragoal writes the durable GOAL.md; visual-teach builds the course (this page); brightdata-cli is the one path to real web evidence; computer-use-cli drives native macOS apps non-blocking. The same five are installed across a dozen agents, so any agent can pick up the work.

One harness (clay, center), five skills around it. The Forge front-end and the toolbelt all plug into the same loop.

The harness

loop-engineering runs LEARN → ANALYZE → EXECUTE → VERIFY → DECIDE, drives the Forge 7 steps, and orchestrates the AFK crew (an Orchestrator delegating units via cli -p; an Executor that builds; a Validator that proves and is never the builder). It always ends a non-trivial job by producing a visual-teach course like this one.

The five skills

ultragoal — the durable-goal discipline behind GOAL.md; agent/CLI/model-agnostic. Universal activation is a durable goal run under the loop.
visual-teach — the course engine; emits the self-contained EN + PT-BR lessons.
brightdata-cli — the one uniform path to web data (SERP, scrape, browser, 40+ datasets). Always this CLI; never WebSearch/WebFetch; never the Bright Data MCP.
computer-use-cli — native macOS automation through the accessibility API only; non-blocking, never moves the cursor or raises the app.
Forge — the 7-step front-end that turns a raw ask into the run above.

Control-plane lineage: steipete/agent-scripts · ultragoal lineage: jxnl/dots.

The handoff: pick it up

That is the whole suite doing one job. Now it is yours. You do not need to remember the seven steps or the five states — you need to remember one move: invoke /loop-engineering. Give it your rough ask. For anything vague it runs Forge first; for anything concrete it goes straight to the loop. It runs AFK, it proves its own work at the real boundary, and it ends — exactly like this — by handing you a visual-teach course so the next person can pick it up too.

The only thing you keep doing is the thing you did on this whole page: observe. Read LOOP-LOG.md. Read review.md. Answer the one handoff fork when it comes. That is the job.

One move to start (/loop-engineering), AFK in the middle, a shipped result and a course out — and the loop is ready to run again.

Your teacher is one message away. This was the whole suite on one ask. Want to run it on your ask? Tell the agent the rough idea and say "drive it with /loop-engineering." Not sure if your ask is vague enough to need Forge, or concrete enough to go straight to the loop? Ask — that is exactly the kind of question to bring here.

In the code: where the run lives on disk

The run is not magic — it is a few plain files an agent reads and writes, plus the one command you type. Here are the real artifacts the run leaves behind, and exactly how to open them.

the artifacts of one run (under the repo it operates on)

# the durable contract every Validator checks against
GOAL.md            # goal · context · constraints · verification · done-when
# the Forge outputs
research.md        # facts pulled by the Bright Data CLI (cited)
PRD.md             # problem · scope · non-goals · done-when
issues/            # tickets with BLOCKING edges — the kanban
# the run records (what you read; you never execute)
LOOP-LOG.md        # every pass: LEARN→ANALYZE→EXECUTE→VERIFY→DECIDE
review.md          # the AFK QA observability report

Start the run

One move, from any agent that has the skills installed:

# in the repo you want changed:
/loop-engineering  "RHG needs a /health endpoint and a status page, behind login, shipped safely"

Watch it (observe, never drive)

# tail the run log as passes land
tail -f LOOP-LOG.md
# read the QA report when phase 4 emits it
cat review.md
# the contract the whole run is held to
cat GOAL.md

Re-run the proof yourself (optional)

The done-when is a command, so you can confirm it by hand — the same check the Validator ran:

curl -s -o /dev/null -w "%{http_code}\n" localhost:8080/health   # → 200

Quick check

Recall beats re-reading. Try each question from memory before you click — the answer reveals on click, with the why. No tells in the formatting; every option is the same length.

Q1During the AFK run, what is the human's only job?

B. Everything runs AFK; the human only has observability — reads LOOP-LOG.md / review.md / status, never executes, not even the QA. The single exception is a genuine user-only fork, surfaced as a handoff.

Q2What makes a unit "done" in the loop?

C. The Proof Gate is a command run against the real artifact — never a claim, never a mock. In our run that was curl … /health → 200 on the actual service. A claim of "complete" lost to the boundary in §9.

Q3Who proves that an Executor's work meets the goal?

A. The Validator is never the builder. Separating who builds from who proves is what stops an agent rubber-stamping its own work — the §7 "Med" risk made concrete.

Q4When does Forge run before the loop?

D. Forge is the front-end for a raw or vague ask — its grill converges the scope into a measurable done-when. A concrete ask can go straight to the loop with a GOAL.md.

Q5Why did the failed gate in §9 not become an outage?

B. Because the loop never advances on a claim, the 500 was caught at VERIFY before anything shipped. The failure fed back into ANALYZE and the loop fixed it itself — two extra passes, zero humans woken.

Q6Which is the always-the-CLI path for web evidence?

C. Web evidence is always the Bright Data CLI — the one uniform path every agent has through the shell. Never WebSearch/WebFetch, and never the Bright Data MCP.

Score: 0 / 6

That is the course. You now have the model, the front-end, the toolbelt, and the one move that runs them all. Open any technical panel you skipped, then take it to your own work — your teacher is one message away.

The big idea: a rough idea, shipped, while you watched

The loop (the engine)

Forge (the 7-step front-end)

AFK + observability

The whole run in one picture

Walk the run end to end

The gate is a command, not an opinion

Anatomy of the ask: from a sentence to a contract

The grill's output (the sharpened scope)

The contract it wrote — GOAL.md

The plan: the PRD as phases with exit bars

Frame the ask

Tasks

Exit criteria

Risks & mitigations

Decompose into a board

Tasks

Exit criteria

Risks & mitigations

Implement under the loop

Tasks

Exit criteria

Risks & mitigations

Review & ship

Tasks

Exit criteria

Risks & mitigations

An exit criterion is a gate, not a feeling

The milestone bar is a live state machine

The issue board moves itself

One array is the source of truth

BLOCKING is enforced in one place

Authorization tiers in action: what stayed user-gated

One map decides what is "actually live"

Replay one AFK pass, state by state

The pass is a finite-state machine

When a proof failed: the recovery

The pass where /health returned 500

Timeline of the failed pass

Root cause — five whys

The gate did exactly its job

Blameless, and it hardens the gate

Blast radius & the fixes

The improve-the-prompt branch: tune the contract, watch it rebuild

One pure function maps knobs → instruction

Improve the artifact OR improve the prompt

The shipped result: a PR write-up

The pain

The goal

Why now

The reviewer's real questions, front-loaded

How it was built: the suite map

The harness

The five skills

The handoff: pick it up

In the code: where the run lives on disk

Start the run

Watch it (observe, never drive)

Re-run the proof yourself (optional)

Quick check

The contract it wrote — `GOAL.md`