{
  "slug": "self-healing-bug-bot",
  "url": "https://adaas.dev/blog/self-healing-bug-bot",
  "formats": {
    "html": "https://adaas.dev/blog/self-healing-bug-bot",
    "markdown": "https://adaas.dev/blog/self-healing-bug-bot.md",
    "plaintext": "https://adaas.dev/blog/self-healing-bug-bot.txt",
    "json": "https://adaas.dev/blog/self-healing-bug-bot.json"
  },
  "title": "Building a Self-Healing Bug Bot: The Autonomous Dev System We Use at Chipp",
  "description": "The implementation post. Five components, real bash, real Claude Code, and the system that ships 20-30 production changes a day at Chipp without a pull request in sight. Includes the harness skeleton you can copy, the MCP fleet we run, and an honest accounting of cost and failure modes.",
  "publishedAt": "2026-05-01",
  "updatedAt": null,
  "author": "Hunter Hodnett",
  "authorRole": "Co-founder & CTPO, Chipp",
  "authorUrl": null,
  "category": "Engineering",
  "tags": [
    "bug-bot",
    "autonomous-coding",
    "agentic-workflows",
    "claude-code",
    "self-healing"
  ],
  "keywords": [
    "autonomous coding",
    "agentic workflows",
    "self-healing pipeline",
    "claude code agent teams",
    "bash harness",
    "bug bot"
  ],
  "coverImage": null,
  "readingMinutes": 14,
  "canonicalUrl": "https://adaas.dev/blog/self-healing-bug-bot",
  "bodyMarkdown": "At 3:47 AM, Bug Bot pushed a fix to production.\n\nI learned about it the next morning. The error had landed in our log drain at 3:31. The fix had shipped at 3:47. Sixteen minutes from production fire to production deploy.\n\nI was asleep through all of it.\n\nThis post is how to build the system that lets you sleep.\n\n## What Bug Bot is\n\nBug Bot is the autonomous development cluster that runs Chipp. It picks up production bugs and feature tickets from four trigger sources, runs each through a five-stage pipeline, and pushes verified code to production without human review. Eight workers run in parallel on a single workstation. We ship 20–30 production changes per day. There are no pull requests in this system.\n\nThe high-level case for it is in [The Autonomous Development Manifesto](/blog/autonomous-development). This post is the implementation. If you've been wondering whether you could build one of these for your own product, the answer is yes, and what follows is enough to start.\n\n## The architecture\n\n```\n   ┌───────────────────────────────────────────────────────┐\n   │                   TRIGGER LAYER                        │\n   │   Slack       Email       Grafana      P95 Latency    │\n   │    tag       forward      webhook         alert       │\n   └─────┬──────────┬───────────┬───────────────┬──────────┘\n         │          │           │               │\n         └──────────┴────┬──────┴───────────────┘\n                         ▼\n                  ┌──────────────┐\n                  │ TICKET QUEUE │\n                  └──────┬───────┘\n                         ▼\n            ┌──────────────────────────┐\n            │   BASH HARNESS POOL       │\n            │   (8 workers, max)        │\n            └────┬─────┬─────┬─────┬───┘\n                 │     │     │     │\n                 ▼     ▼     ▼     ▼\n            ┌────────────────────────┐\n            │  CLAUDE CODE PIPELINE  │\n            │                        │\n            │ Phase 0: Doc retrieval │\n            │ Phase 1: Research      │\n            │ Phase 2: Implement     │\n            │ Phase 3: Review        │\n            │ Phase 4: Docs update   │\n            │ Phase 5: Push          │\n            └─────────┬──────────────┘\n                      ▼\n                  [PRODUCTION]\n                      │\n                      ▼\n        (errors loop back via trigger layer)\n```\n\nFive components. Each one is replaceable; what matters is that they all exist and they all integrate.\n\n## Component 1: The trigger layer\n\nTickets need to land in your queue. We use four sources.\n\n### Loki + Grafana for production errors\n\nSelf-hosted log aggregation. Every server-side error is logged to Loki via a structured log call:\n\n```typescript\nlog.error({\n  service: 'billing',\n  feature: 'create_customer',\n  err: error.stack,\n  user_id: hashUserId(userId),\n});\n```\n\nSensitive fields are one-way encrypted before they ever land in logs. The agent gets metadata, not secrets.\n\nA Grafana alert rule fires every five minutes:\n\n```\nWHEN errors_count_5m > 5\n  AND grouped_by_stack_trace\nTHEN webhook(POST /bug-bot/trigger)\n```\n\nThe five-minute window dedupes, if 47 instances of the same error happen, the agent gets one ticket, not 47.\n\n### Webhook server\n\nA small Hono server listens for Grafana webhooks. Its only job is to construct a Bug Bot prompt and add it to the queue:\n\n```typescript\napp.post('/bug-bot/trigger', async (c) => {\n  const alert = await c.req.json();\n  const prompt = `\nProduction error detected.\n\nService: ${alert.labels.service}\nFirst seen: ${alert.firstSeen}\nAffected users: ${alert.uniqueUsers}\nStack trace:\n${alert.stackTrace}\n\nInvestigate and fix. The auto-load table will pull relevant docs.\n`;\n  await ticketQueue.add({ source: 'grafana', prompt });\n  return c.json({ ok: true });\n});\n```\n\n### Slack tag\n\nA Slack listener watches our internal `#chipp-rewrite-bugs` channel for `@bug bot` mentions. The thread becomes the prompt:\n\n```typescript\nslack.event('app_mention', async ({ event }) => {\n  const thread = await slack.getThread(event.thread_ts);\n  await ticketQueue.add({\n    source: 'slack',\n    prompt: thread.messages.map(m => `${m.user}: ${m.text}`).join('\\n'),\n  });\n});\n```\n\n### Email forward\n\nI forward customer emails to a Bug Bot inbox. A Mailgun webhook converts each email into a ticket. (I dictate most of mine via Whisper Flow on my phone, yes, I voice-message my engineering team.)\n\n### P95 latency alert\n\nA separate Grafana alert rule fires if our chat-streaming P95 exceeds three seconds. Different prompt template, same queue.\n\nThe shape of the trigger layer matters less than the principle: tickets should land in your queue from anywhere a human or system might notice a problem.\n\n## Component 2: The bash harness\n\nThis is the most important component. The bash harness is what turns Claude, non-deterministic, prone to running long, occasionally trying to `git push --no-verify` its way out of a problem, into a deterministic teammate.\n\n> \"An autonomous agent without a bash harness is an intern with no manager, no deadline, and an unlimited API budget.\"\n> — Hunter Hodnett, Chipp CTPO\n\nThe harness is shell script, not Node or Python. We've considered both. Bash wins because Claude is also writing the harness, and Claude has more bash training data than any other shell language. The harness needs to be readable, debuggable, and easily edited by the same agent it manages.\n\n### Skeleton\n\n```bash\n#!/usr/bin/env bash\nset -euo pipefail\n\nWORKER_ID=$1\nTICKET_FILE=$2\nWORKTREE_DIR=\"/tmp/bug-bot/worker-${WORKER_ID}\"\nDEV_PORT=$((5180 + WORKER_ID))\nIDLE_TIMEOUT=300  # 5 minutes\nBANNED_FLAGS=\"git push --no-verify|git reset --hard|rm -rf /\"\n\n# Set up worktree\ngit worktree add -b \"bot/${WORKER_ID}-$(date +%s)\" \"$WORKTREE_DIR\" main\ncd \"$WORKTREE_DIR\"\n\n# Spawn dev server in background on dedicated port\nPORT=$DEV_PORT pnpm dev > \"/tmp/bug-bot/worker-${WORKER_ID}.log\" 2>&1 &\nDEV_PID=$!\n\n# Run the 5-stage pipeline\nfor STAGE in research implement review docs push; do\n  PROMPT_FILE=\"prompts/${STAGE}.md\"\n\n  # Spawn Claude in headless mode\n  timeout $IDLE_TIMEOUT claude -p \\\n    --dangerously-skip-permissions \\\n    --append-system-prompt \"$(cat $PROMPT_FILE)\" \\\n    < \"$TICKET_FILE\" 2>&1 | tee \"stage-${STAGE}.log\" &\n  CLAUDE_PID=$!\n\n  # Banned-flag watch\n  while kill -0 $CLAUDE_PID 2>/dev/null; do\n    if grep -qE \"$BANNED_FLAGS\" \"stage-${STAGE}.log\"; then\n      echo \"BANNED FLAG DETECTED — killing worker $WORKER_ID\"\n      kill $CLAUDE_PID\n      exit 1\n    fi\n    sleep 2\n  done\n\n  wait $CLAUDE_PID\ndone\n\n# Force final commit + push if not already done\nif ! git diff --cached --quiet; then\n  git add -A\n  git commit -m \"[bug-bot/${WORKER_ID}] $(cat ticket-summary.txt)\"\nfi\ngit push origin HEAD:staging\n\n# Cleanup\nkill $DEV_PID\ngit worktree remove \"$WORKTREE_DIR\" --force\n\n# Log outcome for fine-tuning\necho \"{\\\"worker\\\": ${WORKER_ID}, \\\"ticket\\\": \\\"$(basename $TICKET_FILE)\\\", \\\"outcome\\\": \\\"clean\\\"}\" \\\n  >> /var/log/bug-bot/outcomes.jsonl\n```\n\nThis is simplified. Our production version has more error handling and outcome labeling. The shape is right.\n\n### What the harness enforces that Claude can't\n\n- **Idle kill.** If Claude doesn't fire a tool call for five minutes, the session is killed. This catches the case where Claude gets stuck in a \"let me think about this\" loop.\n- **Banned-flag grep.** If Claude attempts `git push --no-verify`, `git reset --hard`, or `rm -rf` against an absolute path, the session is aborted.\n- **Forced commit + push.** Claude occasionally completes work but forgets the final push. The harness checks the worktree state and forces it.\n- **Worktree cleanup.** Each run is isolated; nothing leaks between workers.\n- **Port allocation.** Each worker gets a dedicated dev server port (5180 + worker ID).\n- **Outcome logging.** Every run writes a JSONL row to a fine-tuning archive. (More on this below.)\n\n## Component 3: The five-stage Claude pipeline\n\nEach stage is its own Claude Code session, with its own context window. The stages communicate via files written to disk.\n\n### Phase 0: Doc retrieval (bash, not Claude)\n\nBefore any Claude session runs, a bash script semantic-searches `/docs/` for files relevant to the ticket and writes the results to `pre-context.md`:\n\n```bash\ndocs-search \"$(cat ticket.txt)\" > pre-context.md\n```\n\n`docs-search` is a small CLI we wrote that runs OpenAI's embeddings API over our `/docs/` folder once per week and stores vectors in a local SQLite file. Could be any vector store. The point is to load relevant context before Claude opens its first context window.\n\n### Phase 1: Research\n\n```\nYou are the research agent for an autonomous dev pipeline.\n\nRead the ticket. Read pre-context.md. Read relevant code.\nQuery Loki for similar errors. Query the database if useful.\nForm a hypothesis.\n\nOutput a plan.md with:\n- Root cause\n- Affected files\n- Implementation steps\n- Test strategy\n- Risks\n\nDO NOT edit any source files in this phase.\n```\n\nOutput: `plan.md`. Context window can fill up to 1M tokens of investigation; only the plan survives.\n\n### Phase 2: Implement\n\n```\nYou are the implement agent.\n\nRead plan.md. Read pre-context.md. That's your context.\nMake the code changes described in plan.md.\nRun unit tests for affected files.\nRun the full test suite.\nSpin up dev server on port ${DEV_PORT}.\nOpen browser MCP. Navigate to affected URLs.\nRead browser console + dev server logs.\nFix anything broken.\n\nCommit your changes when verified.\n```\n\nFresh context window. The agent never sees the original investigation, only the distilled plan.\n\n### Phase 3: Review\n\n```\nYou are the review agent.\n\nRead the diff. Red-team it.\nLook for: edge cases, security issues, type errors, broken contracts.\nYou can edit. If you make more than 5 edits, the implement agent's work is flagged messy.\n\nOutput: approved | needs-rework, plus reasoning.\n```\n\n### Phase 4: Docs update\n\n```\nYou are the docs agent.\n\nGiven the diff, identify any non-obvious behavior introduced.\nWrite or update markdown files in /docs/ to capture it.\nIdentify any docs the change has invalidated. Prune them.\nUpdate the auto-load table at the top of CLAUDE.md if needed.\n```\n\nThis is how the system gets smarter over time.\n\n### Phase 5: Push\n\nBash, not Claude. Final commit, push to staging branch, monitor deploy.\n\n## Component 4: The MCP fleet\n\nWithout MCP, your agent can read code and reason. With MCP, it can verify, query, and act. The four MCPs every Bug Bot setup needs:\n\n### Browser MCP (custom, dev-tools protocol)\n\nThis is the single most important MCP in autonomous development. Without it, you're guessing.\n\nOur browser MCP wraps a local Chromium instance via the dev-tools protocol. It exposes:\n\n- `browser_navigate(url)`, go to a page\n- `browser_screenshot()`, return a base64 image\n- `browser_console_logs()`, return recent console messages\n- `browser_click(selector)`, interact with the page\n- `browser_dev_login(role)`, bypass our auth flow with seeded test credentials\n\nThat last tool is the differentiator. Off-the-shelf browser MCPs are generic. The MCP we run for Chipp knows how to log in as a free user, an enterprise user, or a paying user with exhausted credits, without going through the human OAuth flow. That domain knowledge is what makes verification fast.\n\n### Log-drain MCP (custom)\n\nWraps Loki. Exposes:\n\n- `loki_query(labels, time_range)`, run a LogQL query\n- `loki_user_breadcrumbs(user_id, time_range)`, pull a user's recent interactions before the error fired\n\nThe user breadcrumbs tool is what lets the agent reconstruct the user journey that led to a bug, and propose fixes that match real usage, not synthetic edge cases.\n\n### Database MCP (custom)\n\nWraps our database with hard-coded safe column lists. We give the autonomous agents read access to production. The MCP enforces:\n\n- No `SELECT *`. The MCP returns only the columns you've explicitly allowed.\n- Sensitive columns (passwords, OAuth tokens, payment methods) are filtered out at the MCP layer; the agent never sees them in any session.\n- All queries are read-only by default. We have a write-enabled variant gated behind an additional bash-harness check.\n\nWe tried off-the-shelf database MCPs first. They hallucinated column names constantly. Custom won.\n\n### File system + bash (built-in)\n\nClaude Code includes file system and bash tools by default. You don't need to install these. You do need to ensure your `CLAUDE.md` documents which paths are off-limits and which commands are dangerous.\n\n## Component 5: The verification loop\n\nThe browser MCP is the loop. Here's the actual sequence each implement agent runs after writing code:\n\n1. Code changes saved in worktree.\n2. Worktree's dev server (already running on dedicated port) auto-reloads.\n3. Agent calls `browser_navigate('localhost:5184/affected-page')`.\n4. Agent calls `browser_screenshot()`. Reads the image.\n5. Agent calls `browser_console_logs()`. Reads the console output.\n6. If no errors, the agent calls `browser_click('#confirm')` to interact with the changed UI.\n7. Repeat screenshot + logs read.\n8. If errors detected, the agent forms a hypothesis, edits the code, and the loop starts over.\n\nThe loop is what separates autonomous development from vibe coding. Vibe coding ends with the diff. Autonomous development ends with verified production code.\n\n> \"Claude writing code without verification is a liability. Claude writing code and verifying and pushing to prod is a teammate with commit access.\"\n> — Hunter Hodnett, Chipp CTPO\n\n## Outcome logging for fine-tuning\n\nEvery Bug Bot run writes a JSONL row to a long-term archive:\n\n```json\n{\n  \"ticket_id\": \"billing-create-customer-null-pmt\",\n  \"trigger_source\": \"grafana\",\n  \"started\": \"2026-04-15T03:31:18Z\",\n  \"finished\": \"2026-04-15T03:47:02Z\",\n  \"stages\": {\n    \"research\": { \"tokens\": 412053, \"tool_calls\": 38 },\n    \"implement\": { \"tokens\": 187234, \"tool_calls\": 23 },\n    \"review\": { \"tokens\": 91482, \"tool_calls\": 12, \"edits\": 1 },\n    \"docs\": { \"tokens\": 43210, \"tool_calls\": 4 },\n    \"push\": { \"tokens\": 0, \"tool_calls\": 0 }\n  },\n  \"outcome\": \"clean\",\n  \"regressions_detected_24h\": false\n}\n```\n\nThe `outcome` field is the label. `clean` means: review made ≤5 edits, all tests passed first try, no regressions detected within 24 hours of deploy. `messy` means anything else.\n\nThis data is gold. Every successful autonomous run produces a labeled training row showing how a frontier model approached a real engineering task. Builders who treat their pipeline outputs as a strategic data asset, instead of throwing them away after each run, end up with the training data to fine-tune cheaper specialized models on their own codebase. That's a moat. We'll cover the mechanics of it in a future post.\n\n## The cost reality\n\nBug Bot is not free. Each ticket runs through five Claude Code sessions, each with substantial context. Order of magnitude: low double-digit dollars per ticket on a frontier model, at current pricing.\n\nThat sounds expensive until you compare it to the alternative. A single Bug Bot ticket replaces approximately a junior engineer's day of work, read the stack trace, find the bad commit, write the fix, test it, ship it. The cluster runs all day, all night, with no benefits package.\n\nWe get roughly a 10–50x cost advantage versus traditional engineering labor for the kind of work Bug Bot does best (fixing bugs in well-documented code paths, building features within an established architecture). For more open-ended work, designing new systems, debugging hardware integrations, reasoning about edge cases that aren't represented in our training data, the cost advantage compresses, sometimes to break-even.\n\nThe honest truth: Bug Bot succeeds on first try about 70–80% of the time. The other 20–30% require a re-prompt, often because we didn't include enough context the first time. We treat those failures as scar tissue. Almost every re-prompt becomes a doc, a `CLAUDE.md` rule, or an auto-load table entry that prevents the same failure next time.\n\n## When this fails (and how we fix it)\n\nFailure modes worth knowing about before you start:\n\n### Cross-tool integrations\n\nAnything outside your code base is high risk. Bug Bot is great at fixing bugs in our own code. It's worse at debugging issues with a Stripe API change, a LiveKit voice agent update, or any third-party service whose behavior the agent can't directly observe.\n\nThe fix is custom MCPs. We built a Stripe MCP that wraps Stripe's API in tools the agent can call directly. Same for LiveKit. The pattern: any external dependency that breaks Bug Bot's success rate gets its own MCP server.\n\n### Decomposition failures\n\nBug Bot is designed for tasks that fit in one pipeline run. *\"Fix this billing bug\"* works. *\"Build a new analytics dashboard with 12 widgets\"* doesn't.\n\nThe bottleneck isn't execution. It's decomposition. Large features need a human (or another autonomous layer) to break them into pipeline-sized tickets. We handle this manually for now. The next iteration of Bug Bot will include a decomposition stage that runs before the research stage.\n\n> \"Hard part is decomposition, not execution.\"\n> — Hunter Hodnett, Chipp CTPO\n\n### MCP server downtime\n\nIf your browser MCP or database MCP goes down, your agents lose their senses mid-session. We treat MCP servers as production infrastructure: monitored, alerted, deployed in pairs.\n\n### Banned-flag false positives\n\nOccasionally the harness kills a session for what looks like a banned flag in a comment or test fixture. We've tightened the regex over time. When in doubt, log the false positive and investigate; don't relax the regex pre-emptively.\n\n## What this gives you\n\nThe 3:47 AM moment becomes routine.\n\nThe on-call rotation goes empty. PagerDuty escalations stop. Senior engineers stop reviewing AI-generated PRs because there are no PRs. The PR queue empties because there's no concept of a PR in this system. Customer-reported bugs get fixed before the customer support team has finished writing the ticket.\n\nYou sleep through the night. You wake up to a Slack channel full of completed work. You spend your day on decomposition, judgment, and the kinds of architectural decisions only a human can make, because every other thing has been done by the cluster.\n\nThat is what Bug Bot gives you. It's also what we're productizing as Alchemist for builders who'd rather not spend nine months building it themselves.\n\n**[Join the Alchemist waitlist →](/#waitlist)**\n\n---\n\nIf you want the foundational case for autonomous development, start with [The Autonomous Development Manifesto](/blog/autonomous-development).\n\nIf you want to understand the discipline that makes this all work, the foundation underneath the harness, the pipeline, and the MCPs, read [Context Engineering: The Skill That Turns Claude Into a Production Co-Developer](/blog/context-engineering)."
}