You opened Claude Code at 9am with a clear task. Two hours later the code exists, you missed your standup, and you were present for every single agent turn — steering, correcting, re-specifying.

That's not a productivity win. You were the bottleneck on every one of those turns — and the moment you stepped away, everything stopped.

Here's what it means and how to build it.

You Are the Outer Loop

When you work turn-by-turn with a coding agent, you're doing four jobs: scheduler, context manager, verifier, and state machine. The moment you close the laptop, everything stops. You can't parallelise beyond however many windows you're watching.

The ceiling isn't the agent's capability — it's your attention.

The DORA 2025 data shows 90% of developers using AI and 80% reporting productivity gains. The same data shows higher AI adoption correlating with higher delivery instability. Faros's 2026 telemetry across 22,000 developers: PR review time up 441%, incidents per PR up 242%. You went faster into more problems. The bottleneck moved — from writing code to specifying and verifying it.

You didn't remove yourself from the loop. You just made the loop faster.

Two Loops. One You Build.

Every coding agent already runs an inner loop: read context → call a tool → observe the result → repeat. Claude Code implements this as a recursive async generator. You don't build it. You don't touch it.

What you build is the outer loop: decide what task to run, give the agent context, check the result against the spec, persist state between sessions, trigger the next task. Right now you're playing all four of those roles manually.

A loop without a real check is just the agent agreeing with itself on repeat, faster.

The check matters more than the loop. CMU researchers (Xu, Martelaro, McComb, arXiv:2603.24768, May 2026) tested three architectures on an engineering task — plain loop, self-verifying loop, and a loop with a separate evaluator agent. The separate evaluator won decisively. The generator cannot reliably grade its own work. This is structural, not a prompting problem.

Separate the maker from the checker — always.

Anatomy of a Loop

Addy Osmani's June 2026 breakdown named five components every production loop needs. Both Claude Code and Codex now ship all five natively. Here's what each one does and why skipping it breaks you:

1. Trigger — what starts work without you

A git push, CI failure, cron schedule, Slack message, PR opened. Without a trigger, you're still the one who decides when the loop runs. That makes it a one-shot script, not a loop.

Claude Code: /schedule, hooks on git events, GitHub App integration. Codex: Automations tab, event-driven triggers.

2. Worktrees — isolation for parallel agents

Two agents writing to the same directory produce failures that are hard to reproduce and painful to debug — not just merge conflicts, but mid-run overwrites. Git worktrees give each agent a clean, isolated checkout. When the task finishes, the worktree tears down.

Claude Code: isolation: worktree in sub-agent spawn config. The runner handles create/teardown. Why it matters: This is what makes running 3–4 agents in parallel safe rather than chaotic.

3. Skills — memory that survives session resets

Agent sessions are stateless. Every fresh session starts blind — you re-explain the repo, the conventions, the constraints. Skills are the fix: structured context files the agent reads at the start of every session.

  • CLAUDE.md — Claude Code's native instruction file. Build commands, test commands, hard rules, known pitfalls. Every time the agent makes a repeatable mistake, add one line. The file accumulates team knowledge the model would otherwise relearn from scratch.

  • AGENTS.md — The cross-tool open standard (Linux Foundation, Dec 2025). Claude Code doesn't read it natively yet — ln -s AGENTS.md CLAUDE.md works as a bridge.

Keep both files short. Instructions are technical debt. If the model now knows something natively, delete the instruction.

4. Connectors — tools beyond the filesystem

A loop that can only read and write files is limited to what's in the repo. MCP servers extend the agent's reach into the tools your workflow actually uses: GitHub, Linear, Jira, Sourcegraph, internal APIs, monitoring dashboards.

Stripe's internal MCP library has ~500 tools. Each Minion task gets ~15 curated ones. More available isn't better — it degrades decision quality the same way 500 open browser tabs degrades yours.

5. Evaluator — the separate checker

The most important component. Never let the generator verify its own output.

Claude Code /goal (shipped May 2026) implements this natively: define a verifiable completion condition, and a separate evaluator model checks after each turn. Loop continues until the condition passes or the turn cap hits. The writing model and the checking model never share context.

This is also what the CMU research (arXiv:2603.24768) proved empirically — a separate co-regulation agent outperforms self-verification with a large effect size. Build this in from day one.

In Production

Stripe merges 1,300+ agent-written PRs per week. An engineer tags a Slack bot. A deterministic orchestrator prefetches context from Sourcegraph. The agent runs in a disposable cloud devbox — no prod access, no real data. Work moves through a Blueprint: LLM step → deterministic linter gate → LLM step → deterministic commit gate. CI runs in three tiers with a hard two-attempt cap; fail it and it escalates to a human.

The key insight isn't the AI. It's that Stripe's existing engineering infrastructure — devboxes, linters, CI — is exactly what makes unattended agents safe. They didn't build AI infrastructure. They wired the agent into the infrastructure they already had.

Gas Town is Steve Yegge's open-source multi-agent orchestrator: 20–30 parallel Claude Code instances, Mayor/Polecat/Witness roles, git-backed state via Beads. Burn rate ~$100/hr at capacity. This is Stage 7–8 on Yegge's maturity model. It shows what a fully engineered outer loop looks like. It's not where you start.

Loops and Skills in the Wild

Real resources, not vendor docs:

awesome-ralph — The community-curated starting point. Videos, playbooks, Geoffrey Huntley's canonical deep-dive, variant implementations. Read this before writing any bash.

cobusgreyling/loop-engineering — CLI tools built around the primitives: loop-init scaffolds CLAUDE.md + STATE.md + PROMPT.md, loop-audit inspects run history, loop-cost estimates token spend before you run. The primitives matrix here is the clearest single-page map of what a loop needs.

gastownhall/gastown — Yegge's orchestrator, open source. Worth reading the architecture docs even if you never run it. The Mayor/Polecat/Witness role separation and Beads state design translate to simpler systems.

VILA-Lab/Dive-into-Claude-Code — Systematic teardown of Claude Code's codebase. Key finding: 98.4% is deterministic infrastructure. The AI logic is 1.6%. Engineering discipline around the model, not the model itself, is what produces reliable output.

ghuntley.com/ralph — Huntley's original Ralph Loop writeup. Read the section on context rot carefully. His public warning about the official Anthropic Ralph plugin — that it re-feeds prompts into a growing session rather than resetting context per iteration — remains valid and matters if you're considering the plugin as a shortcut.

What Breaks

No completion condition. "Improve the error handling" has no finish line. The agent stops when it subjectively decides it's done. If you can't write the condition before starting, the task isn't loop-ready.

Self-verification. The CMU paper again. Agent marks done, you trust it, reviewer finds three bugs. Use /goal or a separate checker. Non-negotiable.

Missing cost guards. Single-agent loops consume ~4x the tokens of standard chat. Multi-agent ~15x. One underbounded overnight run is a surprise you don't want. Set an iteration cap, a no-progress detector, and a token budget before the first unattended run.

No worktree isolation for parallel agents. Two agents in the same directory produce state that's hard to reason about. Treat worktrees as required infrastructure.

When not to loop. Exploration, architecture decisions, anything where "done" is a judgment call — keep those attended. Loops are force-multipliers on well-specified work. On under-specified work they amplify the ambiguity.

What's Still Rough

Loop observability. When a loop runs overnight and produces thirty commits, auditing the decisions it made and why is harder than it should be. The run log exists. A coherent decision trail alongside the code does not. That's where the next tooling layer will land.

The longer signal: developers in AI-assisted conditions score lower on code comprehension tests (VILA-Lab, citing longitudinal research). Loops that run faster than you can review compound that risk. Use them to expand what you can review and decide — not to replace it.

The job didn't change. You decide what gets built and whether it's good. What changed is that everything between those two decisions can now run without you.

Try This Now

Pick a failing test. Nothing else. Run these three steps.

1. Bootstrap CLAUDE.md — paste this, fill the brackets:

# CLAUDE.md

## Project
[One sentence: what this repo does and stack]

## Commands
- Build: [e.g. `npm run build`]
- Test: [e.g. `npm test` or `pytest tests/`]
- Lint: [e.g. `npm run lint` or `ruff check .`]
- Type check: [e.g. `npx tsc --noEmit`]

## Hard rules
- Never modify [migrations / schema / generated files] directly
- All new functions need a docstring before committing
- No `any` in TypeScript — use `unknown` + type guard

## Conventions
- [e.g. Services in src/services/, named *.service.ts]
- [e.g. Tests live next to the file they test]

## Known pitfalls
- [e.g. Tests need a seeded DB: `npm run seed:test` first]

2. Run your first /goal loop:

claude --goal "tests/auth.test.ts passes and eslint reports zero errors" \
       --max-turns 15 \
       "Fix the failing test in tests/auth.test.ts.
        Read the test first to understand expected behaviour.
        Do not modify the test file itself."

Step away. Read the turn log when it finishes. Every decision that surprised you is a new line in CLAUDE.md.

3. For longer tasks — paste this as run-loop.sh:

#!/bin/bash
# Minimal Ralph Loop: fresh context per iteration, state on disk
# Usage: ./run-loop.sh PROMPT.md 10

PROMPT_FILE="${1:-PROMPT.md}"
MAX_ITER="${2:-10}"
ITER=0

[ ! -f "$PROMPT_FILE" ] && echo "Error: $PROMPT_FILE not found" && exit 1

while [ $ITER -lt $MAX_ITER ]; do
  ITER=$((ITER + 1))
  echo "=== Iteration $ITER / $MAX_ITER ==="

  claude < "$PROMPT_FILE"                                  # fresh context every time
  git add -A && git commit -m "loop: iter $ITER" 2>/dev/null || true

  CHANGES=$(git diff HEAD~1 --name-only 2>/dev/null | wc -l)
  if [ "$CHANGES" -eq 0 ] && [ $ITER -gt 1 ]; then
    echo "No changes. Converged or stuck. Exiting."
    break
  fi
done

echo "Done after $ITER iteration(s). Review: git log --oneline -$ITER"

Pair it with a PROMPT.md that specifies the task, the acceptance criteria, and tells the agent to read LOOP-STATE.md for prior iteration context.

4. The prompt-as-loop pattern — cron triggers it, the prompt drives it:

This is a different approach entirely, described by Owain Lewis in his AI Engineer newsletter. Instead of writing a bash loop, you write the loop logic in the prompt itself — the agent reads the state of your backlog, picks a task, does the work, updates state, and exits. Cron wakes it up on schedule. No iteration logic in shell.

The control plane is GitHub Issues (or Linear). Tickets move through statuses: backlog → agent-ready → in-progress → in-review → done. The agent reads the board, picks one ticket, does the work, posts evidence on the issue, opens a PR, and exits. Cron fires it again next hour.

Create two files:

AGENT_LOOP.md — the agent's complete job description for one run:

# Agent Loop — Worker Run

You are a software engineer working autonomously. Each run, you complete
one unit of work from the backlog. No more.

## Your job on each run

1. Check the current branch is clean. If it isn't, stop and comment on the
   most recent in-progress issue explaining why you stopped.

2. Query GitHub Issues for tickets labelled `agent-ready` and `low-risk`.
   Pick the oldest one. If there are none, exit cleanly — nothing to do.

3. Create a branch: `agent/<issue-number>-<slug>`.

4. Read the issue body carefully. Implement what it describes.
   - Write the code
   - Write or update tests to cover the change
   - Run the test suite — fix any failures before continuing

5. Spawn a sub-agent to review your diff against the issue requirements.
   Apply any findings. Do not open a PR if the reviewer flags unresolved issues.

6. Open a pull request. PR body must include:
   - Which issue it closes
   - What you changed and why
   - Test output showing the suite passes

7. Comment on the issue with the PR link and a short summary of what you did.
   Move the issue status to `in-review`.

## Hard guardrails
- One issue per run. Stop after completing it.
- Do not touch files outside the scope of the issue.
- Do not merge the PR — that decision belongs to a human.
- If anything in the issue is ambiguous, comment asking for clarification
  and exit. Do not guess.

## Context
Read CLAUDE.md for project conventions, commands, and known pitfalls.
Read LOOP-STATE.md if it exists for notes from previous runs.

Schedule it with cron — one ticket per hour, logs to file:

# crontab -e
0 * * * * cd /path/to/repo && claude < AGENT_LOOP.md >> .loop-logs/worker-$(date +\%Y\%m\%d).log 2>&1

The key difference from the bash loop: the agent handles the "what to do next" decision, not the shell. The prompt is the loop. Cron is just the heartbeat. You read the log the next morning and see exactly what the agent picked, what it questioned, and what it left for you.

Start with one ticket per run. Raise the cap only after reviewing ten runs and the quality is consistent.

What does your completion condition look like? Reply — the patterns coming from real tasks are more useful than anything in vendor docs.

References

Foundational

Claude Code

Loops in the Wild

Production

Research

Context

Data

Practitioners

Keep Reading