Agentic Design Patterns for Production: 7 Patterns We Battle-Tested at Chipp

The Gang of Four published Design Patterns in 1994. They didn’t invent any of the patterns in the book. They named twenty-three patterns that OOP developers were already using ad-hoc, and the act of naming them made the practice transferable. Engineers who’d never thought about Strategy or Decorator could read the book, recognize the moves they were already half-doing, and start using them on purpose.

Agentic systems are at the same place now. The shape of how to build production-grade autonomous workflows is becoming clear; the patterns are emerging across teams that have shipped real systems. They just don’t have names yet.

This post names seven of them. Each one is a pattern we’ve battle-tested across two years of running our autonomous development cluster at Chipp. Each one solves a specific failure mode. Each one is portable to whatever stack you’re building on.

You don’t need all seven to start. Pick three. Implement them this month. The other four will become obvious once the first three are working.

Pattern 1: The Multi-Stage Pipeline

Problem: A single Claude Code session that tries to do everything (research, implement, review, document, push) runs out of context budget. The session compacts. The agent loses the thread. The output quality collapses.

Pattern: Split the work into independent stages. Each stage gets its own Claude Code session with a fresh context window. Stages communicate by writing markdown files to disk; the next stage reads only the file the prior one wrote.

Our pipeline at Chipp:

[Trigger] → [Phase 0: Doc retrieval] → [Phase 1: Research]
         → [Phase 2: Implement]   → [Phase 3: Code review]
         → [Phase 4: Docs update] → [Phase 5: Push to prod]

Each phase is its own Claude session. Phase 1’s output is plan.md. Phase 2 reads plan.md (no other context) and writes the code. Phase 3 reads the diff (no other context) and reviews it. And so on.

Why this works: A single 1M-token context window can hold a lot, but it can’t hold everything you want it to hold across an entire feature ticket. By the time the agent has read 30 files, queried logs, formed a hypothesis, written code, run tests, and reviewed its own diff, the window is full and the early reasoning has been compacted to a useless paragraph.

Splitting into stages gives each stage a fresh window. The stage that’s writing code doesn’t need to remember every file the research stage looked at, it just needs the plan. The plan is the distilled output of the research, and distillation survives where raw evidence wouldn’t.

Implementation: Stage outputs are markdown files in a known location. The bash harness orchestrates the handoff. Don’t try to do this with one long-running session and “memory.” It will fail.

Anti-pattern: The temptation is to add more stages. Five is the right number for most pipelines. Don’t go to ten. Each additional stage costs latency and a chance for handoff failure. If a stage isn’t earning its place, merge it into a neighbor.

Pattern 2: Sub-Agent Dilution

Problem: Some investigations require huge amounts of context, reading thousands of lines of logs, running dozens of tool calls, correlating evidence across many sources. If you do this in your main session, you’ve burned the budget on context the main task doesn’t need.

Pattern: Spawn a sub-agent. The sub-agent has its own context window. It does the heavy investigation. It returns a one-paragraph insight to the calling agent. The 950k tokens of evidence stay in the sub-agent’s window, where they belong.

Our infra-ops sub-agent: When the main agent encounters something like “pods are restarting in production,” it doesn’t try to investigate itself. It dispatches the infra-ops sub-agent. That sub-agent runs 47 kubectl commands, queries Loki, cross-references the deploy history, and returns: “OOM-killing because the last deploy lowered the memory limit too aggressively. Recommend bumping requests.memory from 512Mi to 1Gi.”

That two-sentence summary is what lands in the main session. (Full pattern →)

Why this works: The mental model is sending an intern to the library. You don’t want every page they read; you want the answer. Sub-agents give you the architectural shape to do exactly that.

Implementation: Define sub-agents in .claude/agents/. Each one is a markdown file with its own system prompt and tool list. Reference them in your root CLAUDE.md so the main agent knows when to dispatch which.

Anti-pattern: Don’t dispatch a sub-agent for a task the main agent could finish in two tool calls. Sub-agents have overhead, a separate model invocation, the prompt round-trip, the deserialization of the result. They pay off when the task would otherwise fill 50,000+ tokens of context. They cost more than they save when the task is small.

Pattern 3: The Browser Verification Loop

Problem: The agent writes code that compiles, passes tests, and looks correct in the diff. None of that proves the code works in a browser. Buttons can render in the wrong color. Click handlers can throw runtime exceptions. API calls can fail. The agent doesn’t know.

Pattern: After every code change, the agent spins up a dev server, opens a browser via the browser MCP, navigates to the affected page, takes a screenshot, reads the console logs, and verifies the change worked. If anything’s wrong, the agent forms a new hypothesis and iterates.

The actual loop:

Code changes saved in worktree.
Dev server (already running on dedicated port) auto-reloads.
Agent calls browser_navigate('localhost:5184/affected-page').
Agent calls browser_screenshot(). Reads the image (multimodal models see the screenshot).
Agent calls browser_console_logs(). Reads the console output.
If no errors, agent calls browser_click('#confirm') or whatever interaction tests the change.
Repeat screenshot + logs read.
If errors, the agent forms a hypothesis, edits the code, loop restarts.

Why this works: Most “AI ships bad code” stories are stories about agents that wrote plausible-looking code, never tested it, and pushed. The browser loop is the difference between “the agent thinks the code works” and “the agent has checked that the code works.” Closing this loop is the single architectural change that turned our cluster from interesting demo to production system.

Implementation: Custom browser MCP wrapping a headless Chromium via the Chrome DevTools Protocol. Off-the-shelf browser MCPs work for prototyping. For production, build your own, bake in your dev login flow, your seed data, your test scenarios. The custom version is the difference between fast and slow autonomous verification.

Anti-pattern: Don’t run the verification loop on a shared dev server. Each agent worker needs its own port (we use 5180–5187 for our 8-worker pool) so parallel agents don’t fight for the same port. Each agent also needs its own git worktree so they don’t step on each other’s changes mid-loop.

Pattern 4: CLAUDE.md as Scar Tissue

Problem: The agent makes the same mistake on every session. You correct it interactively. The next session, the correction is gone, the agent doesn’t remember what it learned in a different session. You’re paying for the same lesson over and over.

Pattern: Treat your CLAUDE.md as a scar tissue document. Every time the agent makes the same class of mistake three times, stop the session, write a rule into CLAUDE.md that prevents it, and continue. Over months, your CLAUDE.md accumulates the real rules of your codebase, the ones you can only learn by getting bitten.

Why three strikes: Once is an outlier. Twice is suspicious. Three times is a pattern. Patterns are what CLAUDE.md is for. Adding a rule per mistake bloats the file with one-off lessons that dilute the load-bearing rules.

Why this works: CLAUDE.md loads in every session and survives compaction. It’s the only place to put context that you want the agent to have forever without paying for re-discovery on every run. Every line in CLAUDE.md pays compounding dividends.

Implementation: Have the agent write the rules into CLAUDE.md for you. The model knows what kind of rule will register on its own future inference better than you do. When the agent makes a mistake, prompt it: “Add a rule to CLAUDE.md that prevents this exact mistake. Cite the failure mode. Make it specific enough to act on.” Then read what it wrote and tighten it.

Anti-pattern: Aspirational CLAUDE.mds. Rules like “always write clean code” and “prefer composition over inheritance” are too vague to act on. The agent ignores them. Replace with specific, scar-tissue-grounded rules tied to real failure modes you’ve seen. (Full discipline →)

Pattern 5: The Auto-Load Table

Problem: Some context is too domain-specific to live in your root CLAUDE.md (it would bloat every session) but too cross-cutting to live in a single subdirectory CLAUDE.md (the rules apply across the codebase whenever a topic is mentioned, not whenever a directory is touched).

Pattern: At the top of your root CLAUDE.md, put a small markdown table mapping keywords to documentation files. When a prompt mentions any of the keywords, the agent reads the corresponding doc into context before starting work.

Our table at Chipp:

## Auto-load table

| Mention | Read |
|---|---|
| billing, stripe, payment, subscription | docs/billing.md |
| auth, login, session, oauth | docs/auth.md |
| websocket, realtime, streaming | docs/realtime.md |
| voice, livekit, transfer | docs/voice-agents.md |
| migration, schema, kysely | docs/db-migrations.md |

The keywords are inclusive, if a ticket mentions “stripe” or “subscription” or “billing,” the agent loads docs/billing.md. The rules in that doc are far too specific to put in the root CLAUDE.md (Stripe API quirks, our shadow-billing system, the eight failure modes of webhook delivery), but they’re load-bearing whenever the work touches billing.

Why this works: It scales the system’s domain knowledge horizontally. You can add ten more rows to the auto-load table without inflating the per-session context cost, the docs only load when relevant. Every successful autonomous run that produces a useful insight about a subsystem can write a new doc and get a new row, and tomorrow’s tickets get smarter.

Implementation: Put the table at the top of your root CLAUDE.md. Be conservative with keywords, false positives waste budget. Generate the docs lazily as you ship, don’t try to write all the docs upfront. Use the doc-update phase of your pipeline (Pattern 1) to keep the docs current.

Anti-pattern: Don’t load every doc on every session “just in case.” That’s the bloat that this pattern is designed to prevent. Trust the keyword match. If the agent fails to load a doc it should have loaded, the keyword list was wrong, fix the list, don’t load everything.

Pattern 6: The Bash Harness Wrapper

Problem: Claude Code is non-deterministic. It can hang. It can run for hours. It can attempt commands you don’t want it running. It can finish work and forget to push. Your business needs to be deterministic. Your tokens are limited. Something has to be the adult in the room.

Pattern: Wrap every Claude Code invocation in a bash script that supervises the session. The script enforces timeouts, kills hangs, bans dangerous commands, forces the final commit and push, cleans up worktrees, and writes outcome labels.

What our harness enforces:

Idle kill: if no tool call fires for 5 minutes, kill the session. Catches hangs.
Wall-clock timeout: timeout 7200 (2 hours) caps runaways.
Banned-flag grep: git push --no-verify, git reset --hard, rm -rf are aborted on detection.
Forced commit + push: at the end of every session, check the worktree state and force the push if Claude forgot.
Worktree cleanup: each run isolated; nothing leaks between workers.
Outcome logging: every run writes a JSONL row to a fine-tuning archive. (Pattern 7.)

Why this works: The model is good at writing code. It’s bad at managing its own time, recognizing when it’s stuck, or cleaning up after itself. The harness handles those things deterministically so the model can focus on the work.

“An autonomous agent without a bash harness is an intern with no manager, no deadline, and an unlimited API budget.” — Hunter Hodnett, Chipp CTPO

Implementation: Bash, not Node or Python. Bash is the right language for wrapping Unix processes, subprocess management, signals, timeouts, pipes are all concise in bash and verbose in everything else. Bash is also debuggable in production (no compilation step) and Claude has more bash training data than any other shell language, so the agent can edit the harness too.

The skeleton is in the bug bot post. About 200 lines covers everything above.

Anti-pattern: Don’t try to build the harness in your application stack. Don’t make it a feature of your CI system. The harness is a deliberately tiny, deliberately separate piece of infrastructure. Keep it that way.

Pattern 7: The Outcome-Labeled JSONL Archive

Problem: Every autonomous session generates a record of how a frontier model approached a real task in your codebase. That’s training data. If you don’t capture it, it’s gone forever. If you do capture it, you have the basis for fine-tuning a cheaper, specialized model on your own work, the kind of moat that compounds.

Pattern: After every autonomous run, append a JSONL row to a long-term archive describing the run, the stages, the token spend, the tool calls, the diff, and the outcome label.

Our archive row:

{
  "ticket_id": "billing-create-customer-null-pmt",
  "trigger_source": "grafana",
  "started": "2026-04-15T03:31:18Z",
  "finished": "2026-04-15T03:47:02Z",
  "stages": {
    "research": { "tokens": 412053, "tool_calls": 38 },
    "implement": { "tokens": 187234, "tool_calls": 23 },
    "review": { "tokens": 91482, "tool_calls": 12, "edits": 1 },
    "docs": { "tokens": 43210, "tool_calls": 4 },
    "push": { "tokens": 0, "tool_calls": 0 }
  },
  "outcome": "clean",
  "regressions_detected_24h": false
}

The outcome field is the label. clean means: review made ≤5 edits, all tests passed first try, no regressions detected within 24 hours of deploy. messy means anything else.

Why this works: Labeled data is the asset that produces fine-tuned models. Every successful autonomous run produces a labeled training row showing how a frontier model approached a real engineering task. After a quarter, you have thousands of rows. After a year, you have a dataset no other team can replicate, because it’s specific to your codebase and your practice.

You may never train a model on this data. That’s fine. The decision to capture it is one you make today. The decision to use it is one you can defer for years. But you can’t decide to use data you didn’t capture.

Implementation: One JSONL row per session. Append-only file in cheap storage (S3, R2, even disk for now). Label the outcome with whatever automated heuristics you have, review edit count, test pass rate, post-deploy regression detection. Don’t try to label perfectly; the labels can be improved later.

Anti-pattern: Don’t try to make this data structured beyond JSONL. JSONL is append-only, easy to grep, easy to load into training pipelines. SQL is overkill. NoSQL is a different kind of overkill. Just the file.

How to use these patterns

You don’t need to implement all seven on day one. Most teams who try fail at exactly that, they read this post, they get excited, they try to build a 7-pattern cluster in two weeks, they fail at three of them, and they conclude none of it works.

The order I’d implement them in:

Browser verification loop (Pattern 3). Without this, you can’t autonomy. Build it first. Even if you do nothing else from this list, build this.
Multi-stage pipeline (Pattern 1). The next biggest leverage. Splits your sessions, controls your context budgets, makes everything else possible.
Bash harness (Pattern 6). Once you have a pipeline, you need the supervisor. This is the difference between a hobby project and something you can leave running overnight.
CLAUDE.md as scar tissue (Pattern 4). Discipline, not infrastructure. Start practicing it on day one of using any of the others. The compounding starts immediately.
Auto-load table (Pattern 5). After you’ve shipped enough autonomous tickets to start writing real /docs/ files. Premature otherwise.
Sub-agent dilution (Pattern 2). Once your main pipeline is hitting context-budget walls on the heavy investigations. Solves a real problem; not a problem you have on day one.
Outcome-labeled archive (Pattern 7). Nothing-to-lose pattern. Start it as soon as you have a pipeline. Even if you never use the data, you’ll be glad you have it in a year when distillation becomes the obvious move.

Pick three. Build them this month. The next three become obvious once the first three are working.

What you actually get

A team running these seven patterns ships in a different category than a team running interactive Claude Code sessions, and a radically different category than a team still doing all-human engineering.

The numbers from our cluster, honestly:

20–30 production deploys per day on a two-person engineering team.
70–80% first-attempt success rate on autonomous tickets.
Mean time from production error to fix in production: ~30 minutes, autonomously.
Token cost per ticket: low double-digit dollars on a frontier model.
Pull requests: zero.
Pages we receive overnight: zero.

The patterns aren’t magic. Each one solves a specific failure mode. Together they compose into a system where the failure modes don’t compound, when one pattern hits its edge case, the others catch it.

The Gang of Four made OOP transferable by naming the moves. The seven patterns above make autonomous development transferable. Use them, name them, build on them. We’ll be writing about the next batch as they emerge.

Join the Alchemist waitlist →

If you want the high-level case for autonomous development, read The Autonomous Development Manifesto.

If you want the implementation walkthrough of all seven patterns wired into one cluster, read Building a Self-Healing Bug Bot.

If you want the discipline that underpins every pattern in this post, read Context Engineering.