Building a Self-Healing Bug Bot: The Autonomous Dev System We Use at Chipp
The implementation post. Five components, real bash, real Claude Code, and the system that ships 20-30 production changes a day at Chipp without a pull request in sight. Includes the harness skeleton you can copy, the MCP fleet we run, and an honest accounting of cost and failure modes.
At 3:47 AM, Bug Bot pushed a fix to production.
I learned about it the next morning. The error had landed in our log drain at 3:31. The fix had shipped at 3:47. Sixteen minutes from production fire to production deploy.
I was asleep through all of it.
This post is how to build the system that lets you sleep.
What Bug Bot is
Bug Bot is the autonomous development cluster that runs Chipp. It picks up production bugs and feature tickets from four trigger sources, runs each through a five-stage pipeline, and pushes verified code to production without human review. Eight workers run in parallel on a single workstation. We ship 20–30 production changes per day. There are no pull requests in this system.
The high-level case for it is in The Autonomous Development Manifesto. This post is the implementation. If you’ve been wondering whether you could build one of these for your own product, the answer is yes, and what follows is enough to start.
The architecture
┌───────────────────────────────────────────────────────┐
│ TRIGGER LAYER │
│ Slack Email Grafana P95 Latency │
│ tag forward webhook alert │
└─────┬──────────┬───────────┬───────────────┬──────────┘
│ │ │ │
└──────────┴────┬──────┴───────────────┘
▼
┌──────────────┐
│ TICKET QUEUE │
└──────┬───────┘
▼
┌──────────────────────────┐
│ BASH HARNESS POOL │
│ (8 workers, max) │
└────┬─────┬─────┬─────┬───┘
│ │ │ │
▼ ▼ ▼ ▼
┌────────────────────────┐
│ CLAUDE CODE PIPELINE │
│ │
│ Phase 0: Doc retrieval │
│ Phase 1: Research │
│ Phase 2: Implement │
│ Phase 3: Review │
│ Phase 4: Docs update │
│ Phase 5: Push │
└─────────┬──────────────┘
▼
[PRODUCTION]
│
▼
(errors loop back via trigger layer)
Five components. Each one is replaceable; what matters is that they all exist and they all integrate.
Component 1: The trigger layer
Tickets need to land in your queue. We use four sources.
Loki + Grafana for production errors
Self-hosted log aggregation. Every server-side error is logged to Loki via a structured log call:
log.error({
service: 'billing',
feature: 'create_customer',
err: error.stack,
user_id: hashUserId(userId),
});
Sensitive fields are one-way encrypted before they ever land in logs. The agent gets metadata, not secrets.
A Grafana alert rule fires every five minutes:
WHEN errors_count_5m > 5
AND grouped_by_stack_trace
THEN webhook(POST /bug-bot/trigger)
The five-minute window dedupes, if 47 instances of the same error happen, the agent gets one ticket, not 47.
Webhook server
A small Hono server listens for Grafana webhooks. Its only job is to construct a Bug Bot prompt and add it to the queue:
app.post('/bug-bot/trigger', async (c) => {
const alert = await c.req.json();
const prompt = `
Production error detected.
Service: ${alert.labels.service}
First seen: ${alert.firstSeen}
Affected users: ${alert.uniqueUsers}
Stack trace:
${alert.stackTrace}
Investigate and fix. The auto-load table will pull relevant docs.
`;
await ticketQueue.add({ source: 'grafana', prompt });
return c.json({ ok: true });
});
Slack tag
A Slack listener watches our internal #chipp-rewrite-bugs channel for @bug bot mentions. The thread becomes the prompt:
slack.event('app_mention', async ({ event }) => {
const thread = await slack.getThread(event.thread_ts);
await ticketQueue.add({
source: 'slack',
prompt: thread.messages.map(m => `${m.user}: ${m.text}`).join('\n'),
});
});
Email forward
I forward customer emails to a Bug Bot inbox. A Mailgun webhook converts each email into a ticket. (I dictate most of mine via Whisper Flow on my phone, yes, I voice-message my engineering team.)
P95 latency alert
A separate Grafana alert rule fires if our chat-streaming P95 exceeds three seconds. Different prompt template, same queue.
The shape of the trigger layer matters less than the principle: tickets should land in your queue from anywhere a human or system might notice a problem.
Component 2: The bash harness
This is the most important component. The bash harness is what turns Claude, non-deterministic, prone to running long, occasionally trying to git push --no-verify its way out of a problem, into a deterministic teammate.
“An autonomous agent without a bash harness is an intern with no manager, no deadline, and an unlimited API budget.” — Hunter Hodnett, Chipp CTPO
The harness is shell script, not Node or Python. We’ve considered both. Bash wins because Claude is also writing the harness, and Claude has more bash training data than any other shell language. The harness needs to be readable, debuggable, and easily edited by the same agent it manages.
Skeleton
#!/usr/bin/env bash
set -euo pipefail
WORKER_ID=$1
TICKET_FILE=$2
WORKTREE_DIR="/tmp/bug-bot/worker-${WORKER_ID}"
DEV_PORT=$((5180 + WORKER_ID))
IDLE_TIMEOUT=300 # 5 minutes
BANNED_FLAGS="git push --no-verify|git reset --hard|rm -rf /"
# Set up worktree
git worktree add -b "bot/${WORKER_ID}-$(date +%s)" "$WORKTREE_DIR" main
cd "$WORKTREE_DIR"
# Spawn dev server in background on dedicated port
PORT=$DEV_PORT pnpm dev > "/tmp/bug-bot/worker-${WORKER_ID}.log" 2>&1 &
DEV_PID=$!
# Run the 5-stage pipeline
for STAGE in research implement review docs push; do
PROMPT_FILE="prompts/${STAGE}.md"
# Spawn Claude in headless mode
timeout $IDLE_TIMEOUT claude -p \
--dangerously-skip-permissions \
--append-system-prompt "$(cat $PROMPT_FILE)" \
< "$TICKET_FILE" 2>&1 | tee "stage-${STAGE}.log" &
CLAUDE_PID=$!
# Banned-flag watch
while kill -0 $CLAUDE_PID 2>/dev/null; do
if grep -qE "$BANNED_FLAGS" "stage-${STAGE}.log"; then
echo "BANNED FLAG DETECTED — killing worker $WORKER_ID"
kill $CLAUDE_PID
exit 1
fi
sleep 2
done
wait $CLAUDE_PID
done
# Force final commit + push if not already done
if ! git diff --cached --quiet; then
git add -A
git commit -m "[bug-bot/${WORKER_ID}] $(cat ticket-summary.txt)"
fi
git push origin HEAD:staging
# Cleanup
kill $DEV_PID
git worktree remove "$WORKTREE_DIR" --force
# Log outcome for fine-tuning
echo "{\"worker\": ${WORKER_ID}, \"ticket\": \"$(basename $TICKET_FILE)\", \"outcome\": \"clean\"}" \
>> /var/log/bug-bot/outcomes.jsonl
This is simplified. Our production version has more error handling and outcome labeling. The shape is right.
What the harness enforces that Claude can’t
- Idle kill. If Claude doesn’t fire a tool call for five minutes, the session is killed. This catches the case where Claude gets stuck in a “let me think about this” loop.
- Banned-flag grep. If Claude attempts
git push --no-verify,git reset --hard, orrm -rfagainst an absolute path, the session is aborted. - Forced commit + push. Claude occasionally completes work but forgets the final push. The harness checks the worktree state and forces it.
- Worktree cleanup. Each run is isolated; nothing leaks between workers.
- Port allocation. Each worker gets a dedicated dev server port (5180 + worker ID).
- Outcome logging. Every run writes a JSONL row to a fine-tuning archive. (More on this below.)
Component 3: The five-stage Claude pipeline
Each stage is its own Claude Code session, with its own context window. The stages communicate via files written to disk.
Phase 0: Doc retrieval (bash, not Claude)
Before any Claude session runs, a bash script semantic-searches /docs/ for files relevant to the ticket and writes the results to pre-context.md:
docs-search "$(cat ticket.txt)" > pre-context.md
docs-search is a small CLI we wrote that runs OpenAI’s embeddings API over our /docs/ folder once per week and stores vectors in a local SQLite file. Could be any vector store. The point is to load relevant context before Claude opens its first context window.
Phase 1: Research
You are the research agent for an autonomous dev pipeline.
Read the ticket. Read pre-context.md. Read relevant code.
Query Loki for similar errors. Query the database if useful.
Form a hypothesis.
Output a plan.md with:
- Root cause
- Affected files
- Implementation steps
- Test strategy
- Risks
DO NOT edit any source files in this phase.
Output: plan.md. Context window can fill up to 1M tokens of investigation; only the plan survives.
Phase 2: Implement
You are the implement agent.
Read plan.md. Read pre-context.md. That's your context.
Make the code changes described in plan.md.
Run unit tests for affected files.
Run the full test suite.
Spin up dev server on port ${DEV_PORT}.
Open browser MCP. Navigate to affected URLs.
Read browser console + dev server logs.
Fix anything broken.
Commit your changes when verified.
Fresh context window. The agent never sees the original investigation, only the distilled plan.
Phase 3: Review
You are the review agent.
Read the diff. Red-team it.
Look for: edge cases, security issues, type errors, broken contracts.
You can edit. If you make more than 5 edits, the implement agent's work is flagged messy.
Output: approved | needs-rework, plus reasoning.
Phase 4: Docs update
You are the docs agent.
Given the diff, identify any non-obvious behavior introduced.
Write or update markdown files in /docs/ to capture it.
Identify any docs the change has invalidated. Prune them.
Update the auto-load table at the top of CLAUDE.md if needed.
This is how the system gets smarter over time.
Phase 5: Push
Bash, not Claude. Final commit, push to staging branch, monitor deploy.
Component 4: The MCP fleet
Without MCP, your agent can read code and reason. With MCP, it can verify, query, and act. The four MCPs every Bug Bot setup needs:
Browser MCP (custom, dev-tools protocol)
This is the single most important MCP in autonomous development. Without it, you’re guessing.
Our browser MCP wraps a local Chromium instance via the dev-tools protocol. It exposes:
browser_navigate(url), go to a pagebrowser_screenshot(), return a base64 imagebrowser_console_logs(), return recent console messagesbrowser_click(selector), interact with the pagebrowser_dev_login(role), bypass our auth flow with seeded test credentials
That last tool is the differentiator. Off-the-shelf browser MCPs are generic. The MCP we run for Chipp knows how to log in as a free user, an enterprise user, or a paying user with exhausted credits, without going through the human OAuth flow. That domain knowledge is what makes verification fast.
Log-drain MCP (custom)
Wraps Loki. Exposes:
loki_query(labels, time_range), run a LogQL queryloki_user_breadcrumbs(user_id, time_range), pull a user’s recent interactions before the error fired
The user breadcrumbs tool is what lets the agent reconstruct the user journey that led to a bug, and propose fixes that match real usage, not synthetic edge cases.
Database MCP (custom)
Wraps our database with hard-coded safe column lists. We give the autonomous agents read access to production. The MCP enforces:
- No
SELECT *. The MCP returns only the columns you’ve explicitly allowed. - Sensitive columns (passwords, OAuth tokens, payment methods) are filtered out at the MCP layer; the agent never sees them in any session.
- All queries are read-only by default. We have a write-enabled variant gated behind an additional bash-harness check.
We tried off-the-shelf database MCPs first. They hallucinated column names constantly. Custom won.
File system + bash (built-in)
Claude Code includes file system and bash tools by default. You don’t need to install these. You do need to ensure your CLAUDE.md documents which paths are off-limits and which commands are dangerous.
Component 5: The verification loop
The browser MCP is the loop. Here’s the actual sequence each implement agent runs after writing code:
- Code changes saved in worktree.
- Worktree’s dev server (already running on dedicated port) auto-reloads.
- Agent calls
browser_navigate('localhost:5184/affected-page'). - Agent calls
browser_screenshot(). Reads the image. - Agent calls
browser_console_logs(). Reads the console output. - If no errors, the agent calls
browser_click('#confirm')to interact with the changed UI. - Repeat screenshot + logs read.
- If errors detected, the agent forms a hypothesis, edits the code, and the loop starts over.
The loop is what separates autonomous development from vibe coding. Vibe coding ends with the diff. Autonomous development ends with verified production code.
“Claude writing code without verification is a liability. Claude writing code and verifying and pushing to prod is a teammate with commit access.” — Hunter Hodnett, Chipp CTPO
Outcome logging for fine-tuning
Every Bug Bot run writes a JSONL row to a long-term archive:
{
"ticket_id": "billing-create-customer-null-pmt",
"trigger_source": "grafana",
"started": "2026-04-15T03:31:18Z",
"finished": "2026-04-15T03:47:02Z",
"stages": {
"research": { "tokens": 412053, "tool_calls": 38 },
"implement": { "tokens": 187234, "tool_calls": 23 },
"review": { "tokens": 91482, "tool_calls": 12, "edits": 1 },
"docs": { "tokens": 43210, "tool_calls": 4 },
"push": { "tokens": 0, "tool_calls": 0 }
},
"outcome": "clean",
"regressions_detected_24h": false
}
The outcome field is the label. clean means: review made ≤5 edits, all tests passed first try, no regressions detected within 24 hours of deploy. messy means anything else.
This data is gold. Every successful autonomous run produces a labeled training row showing how a frontier model approached a real engineering task. Builders who treat their pipeline outputs as a strategic data asset, instead of throwing them away after each run, end up with the training data to fine-tune cheaper specialized models on their own codebase. That’s a moat. We’ll cover the mechanics of it in a future post.
The cost reality
Bug Bot is not free. Each ticket runs through five Claude Code sessions, each with substantial context. Order of magnitude: low double-digit dollars per ticket on a frontier model, at current pricing.
That sounds expensive until you compare it to the alternative. A single Bug Bot ticket replaces approximately a junior engineer’s day of work, read the stack trace, find the bad commit, write the fix, test it, ship it. The cluster runs all day, all night, with no benefits package.
We get roughly a 10–50x cost advantage versus traditional engineering labor for the kind of work Bug Bot does best (fixing bugs in well-documented code paths, building features within an established architecture). For more open-ended work, designing new systems, debugging hardware integrations, reasoning about edge cases that aren’t represented in our training data, the cost advantage compresses, sometimes to break-even.
The honest truth: Bug Bot succeeds on first try about 70–80% of the time. The other 20–30% require a re-prompt, often because we didn’t include enough context the first time. We treat those failures as scar tissue. Almost every re-prompt becomes a doc, a CLAUDE.md rule, or an auto-load table entry that prevents the same failure next time.
When this fails (and how we fix it)
Failure modes worth knowing about before you start:
Cross-tool integrations
Anything outside your code base is high risk. Bug Bot is great at fixing bugs in our own code. It’s worse at debugging issues with a Stripe API change, a LiveKit voice agent update, or any third-party service whose behavior the agent can’t directly observe.
The fix is custom MCPs. We built a Stripe MCP that wraps Stripe’s API in tools the agent can call directly. Same for LiveKit. The pattern: any external dependency that breaks Bug Bot’s success rate gets its own MCP server.
Decomposition failures
Bug Bot is designed for tasks that fit in one pipeline run. “Fix this billing bug” works. “Build a new analytics dashboard with 12 widgets” doesn’t.
The bottleneck isn’t execution. It’s decomposition. Large features need a human (or another autonomous layer) to break them into pipeline-sized tickets. We handle this manually for now. The next iteration of Bug Bot will include a decomposition stage that runs before the research stage.
“Hard part is decomposition, not execution.” — Hunter Hodnett, Chipp CTPO
MCP server downtime
If your browser MCP or database MCP goes down, your agents lose their senses mid-session. We treat MCP servers as production infrastructure: monitored, alerted, deployed in pairs.
Banned-flag false positives
Occasionally the harness kills a session for what looks like a banned flag in a comment or test fixture. We’ve tightened the regex over time. When in doubt, log the false positive and investigate; don’t relax the regex pre-emptively.
What this gives you
The 3:47 AM moment becomes routine.
The on-call rotation goes empty. PagerDuty escalations stop. Senior engineers stop reviewing AI-generated PRs because there are no PRs. The PR queue empties because there’s no concept of a PR in this system. Customer-reported bugs get fixed before the customer support team has finished writing the ticket.
You sleep through the night. You wake up to a Slack channel full of completed work. You spend your day on decomposition, judgment, and the kinds of architectural decisions only a human can make, because every other thing has been done by the cluster.
That is what Bug Bot gives you. It’s also what we’re productizing as Alchemist for builders who’d rather not spend nine months building it themselves.
If you want the foundational case for autonomous development, start with The Autonomous Development Manifesto.
If you want to understand the discipline that makes this all work, the foundation underneath the harness, the pipeline, and the MCPs, read Context Engineering: The Skill That Turns Claude Into a Production Co-Developer.