{
  "version": "https://jsonfeed.org/version/1.1",
  "title": "Alchemist AI Blog",
  "home_page_url": "https://adaas.dev/blog",
  "feed_url": "https://adaas.dev/blog/feed.json",
  "description": "Field notes on autonomous software development from the Alchemist team.",
  "language": "en-US",
  "items": [
    {
      "id": "https://adaas.dev/blog/distillation-is-your-moat",
      "url": "https://adaas.dev/blog/distillation-is-your-moat",
      "title": "Distillation Is Your Moat",
      "summary": "Anthropic owns the best coding model. If you build your business on top of Claude (like we do) Anthropic eventually owns your margin. The defense is distillation: training your own smaller model on the outputs of the frontier. It's the most consequential AI policy debate of the decade, and almost nobody is paying attention.",
      "content_text": "I've spent the last six posts explaining how to build an autonomous coding cluster on top of Anthropic's models. This is the post about why that's a temporary arrangement.\n\nIf you're going to build a real business on top of agentic AI, you need to understand the platform risk you're taking. And you need to understand the move that mitigates it (*distillation*) because it's both the long-term moat for AI-native businesses and, increasingly, the most contested issue in AI policy. The AI labs are about to start lobbying to make distillation harder, slower, or illegal. The reasons they'll give will sound noble. The actual reasons are about preserving their monopolies. Pay attention to this fight.\n\nThis post covers what distillation is, why it matters, what we've learned doing it, and why the regulatory dimension is the part you should be loudest about.\n\n## The platform risk\n\nRight now, in early 2026, Anthropic has the best coding model. By a meaningful margin. It's not close.\n\nIf you're building an autonomous coding system, you use Claude. Cursor uses Claude under the hood for most of their work. Cognition (Devin) uses Claude. We use Claude at Chipp. The companies advertising \"Claude alternative\" or \"Gemini-powered\" coding agents are quietly using Claude when their customers care about output quality.\n\nThat dominance is real and it's a problem.\n\nRight now, Anthropic prices their Claude Max plans at $100-$200 per month with very generous usage caps. They subsidize this: every Claude Max session costs them more in compute than they're charging. They do this because they're capturing the market, training on the data, and improving their models faster than anyone else. They're playing the Uber-rides-cost-$3 phase of the platform game. We benefit.\n\nEventually that ends. Anthropic, or the company that buys them, has to stop subsidizing. Prices go up. Rate limits get tighter. Specific use cases (say, \"running thirty parallel agents on a single machine\") get reclassified as enterprise-tier and priced accordingly. We saw the first sign of this when Claude Code quietly moved from the Pro plan to the Max plan in April. That's not the last move.\n\nIf you've built your business on Claude, every move Anthropic makes is a move against your margin. Eventually they'll capture so much of the value you create that the only way out is to cut a deal. That's not a hypothesis. That's how every platform plays out, every time.\n\nThe defense is distillation.\n\n## What distillation is\n\nDistillation, in plain language: take the frontier model's outputs and use them as training data for a smaller, cheaper model that you own.\n\nIn a normal training run, you'd train a model from scratch on raw data: the entire internet, code from GitHub, documentation. That takes hundreds of millions of dollars and 18 months. Distillation skips most of that. You start with an open-source base model: a smaller, less-capable model someone else has already pre-trained at vast expense and released for free. Then you fine-tune that model on a curated dataset of high-quality input/output pairs. Each pair is a question or task, and the answer the frontier model gave to it.\n\nThe smaller model learns to mimic the frontier model's behavior on the specific kinds of tasks you fine-tune for. It doesn't get smarter at everything. But it gets *much* smarter at the slice you trained for. For a narrow, specialized use case (say, autonomous coding in your specific tech stack), a distilled model can hit 60-90% of the frontier model's quality, at a fraction of the inference cost, on hardware you control.\n\nThis is how DeepSeek matched OpenAI's GPT-class models for $5 million in training compute, against the billions OpenAI spent. They distilled. The Chinese open-source community is pioneering this technique, partly because it's an effective way to compete without matching American capital expenditure, and partly because they've correctly identified that the frontier-model arms race is a winner-takes-most game they can't win head-on.\n\n## What it costs to distill\n\nI've been distilling our autonomous coding cluster's outputs for the last few months. The honest scoreboard:\n\n- We've spent about $2,500 on training so far, across a few experimental runs.\n- Our best distilled model, based on Qwen 2.5 14B, performs at about 63% of Claude Opus 4.6's quality on our internal benchmarks.\n- Inference cost on the distilled model is roughly 1/100th of inference on Opus. About $0.04 per ticket vs $4 per ticket.\n\n63% sounds bad. It's not. Most of what an autonomous coding cluster does is not the frontier of model capability. It's the routine work. Reading code, writing CRUD, applying conventions, running tests. A 63% Opus model that costs 1% as much will handle 80% of the tickets. Frontier models can handle the remaining 20%.\n\nThe economics of \"cheap distilled model handles routine tickets, frontier model handles edge cases\" are very different from \"frontier model handles everything.\" The former is sustainable at scale. The latter is a bet that Anthropic stays charitable.\n\nThe distillation process itself is straightforward in shape and difficult in detail. You collect training data: every prompt your frontier model has answered, every tool call sequence, every successful output. You curate it (this is the hard part; bad data poisons the training). You pick a base model and a fine-tuning approach. You run the training. You evaluate. You ship.\n\nWe use Hugging Face for the model hosting and the training infrastructure. The base models we've experimented with most are Qwen (Alibaba's open-source family, currently the strongest open coding models) and Llama (Meta's open family). We use supervised fine-tuning (SFT), the simpler of the two main approaches. The other approach, preference-based methods like DPO and GRPO, requires far more data and is what DeepSeek used to match GPT. SFT is the entry point. The amateur approach. It works.\n\nThe training data is the asset. We've recorded every successful autonomous run on our cluster: every prompt, every tool call, every output, every outcome label (did this fix actually solve the issue?). After a few months, we have a dataset that's specific to our use case in a way no public dataset is. That's the moat.\n\n## Save your training data now\n\nHere's the operational lesson: **start saving your training data today, even if you're not training yet.**\n\nIf you're using Claude Code interactively, your chat history is on your machine. Move it to a stable location. Back it up. Tag the conversations by outcome: did the work ship? Did it break? Did you have to re-prompt? Outcome labels are what makes training data valuable, and outcomes are easy to capture in the moment and impossible to reconstruct after the fact.\n\nIf you're building a product on Anthropic's API, log every request and response. Tag the requests by feature area, by use case, by outcome. Store the logs in a database, not a logfile. You will want to query them.\n\nIf you're operating a SaaS that uses LLMs, your customers' interactions with your AI features are training data. Make sure your terms of service give you the right to use that data for model improvement, and make sure the data is being captured in a usable format.\n\nThree years from now, when distillation is the obvious move, you'll either have years of high-quality training data or you'll have to start from scratch. The decision to capture the data is one you make now. The decision to train on it is one you can defer.\n\n## Why this fight is coming\n\nThe major AI labs do not want you to distill. Distillation breaks their business model. If everyone can take the outputs of the frontier model and train cheaper, comparable models for narrow use cases, the labs lose pricing power.\n\nTheir playbook, which is starting to surface in policy discussions, has three moves.\n\n**Move 1: redefine distillation as theft.** Rename it \"model output exfiltration\" or \"intellectual property circumvention.\" Argue that training one model on another model's outputs is a copyright violation. (The legal argument here is weak, since model outputs are not copyrightable in the same way human writing is, but the labs have a lot of lobbyists.)\n\n**Move 2: lobby for export controls.** Argue that distilled models are a national security risk because foreign adversaries can use the technique to catch up with American AI capability. Get a regulation passed that requires labs to add clauses to their terms of service prohibiting distillation, with criminal penalties for violation. (DeepSeek will be invoked. The implication will be that all distillation is geopolitically dangerous. The actual concern is monopoly maintenance.)\n\n**Move 3: technical countermeasures.** Watermark model outputs in ways that distilled models will inherit, then sue anyone whose model produces watermarked-output-style behavior. (Technically hard. The labs are trying anyway.)\n\nIf you care about a competitive AI ecosystem (and you should, because monopolies are what drive prices up and innovation down) distillation is the single most important policy issue to follow. Pay attention to who's introducing legislation. Pay attention to who's funding it. The \"AI safety\" framing will be heavy in the air. Most of it will be downstream of monopoly preservation.\n\n## The longer arc\n\nWhere I think this all heads, in five years:\n\nOpen-source models will be 80-90% as capable as frontier models on narrow tasks, at 1% of the inference cost. Specialized industries will run their own fine-tuned models on dedicated hardware. There are companies right now taking AI models and embedding them on hardware chips, and that's a real direction. The frontier labs will sell the very-hardest-task tier as a premium service, but the routine work will not be on their infrastructure.\n\nThat's the future where AI-native businesses are durable. The opposite future, where frontier labs own all inference, all data, and all margin, is the one the labs are working toward.\n\nDistillation is the lever. Yours, mine, the open-source community's. Pay attention to this. Fight when it's time to fight.\n\n## Wrapping the series\n\nThis is the last post of the five-part engineering series the manifesto promised. The arc:\n\n1. **The Autonomous Development Manifesto**: the printing press analogy and why this matters now.\n2. **Context Engineering**: the discipline that determines whether your agents work.\n3. **Building a Self-Healing Bug Bot**: the architecture of a real autonomous cluster.\n4. **The Bash Harness**: the manager that supervises the brilliant intern.\n5. **Distillation Is Your Moat**: the long game.\n\nIf you've read all seven, you understand the technology and the strategy of autonomous coding as well as anyone working in the field today. There is no secret you're missing. The work is in the doing.\n\nIf you want a head start (a cluster already tuned, an autonomous engineering team you can describe a SaaS to and watch ship it, a stack you can eject from any time you want) [join the Alchemist waitlist](/#waitlist). We've spent a year and six figures of compute building this. We're packaging it for you.\n\nEither way: stop being a spectator. The next three years are the formative ones for AI-native software, and the people building right now are the ones who will be telling the story.\n\nGet in the arena.",
      "date_published": "2026-05-12T00:00:00.000Z",
      "authors": [
        {
          "name": "Hunter Hodnett"
        }
      ],
      "tags": [
        "distillation",
        "fine-tuning",
        "open-source-models",
        "platform-risk"
      ]
    },
    {
      "id": "https://adaas.dev/blog/the-bash-harness",
      "url": "https://adaas.dev/blog/the-bash-harness",
      "title": "The Bash Harness: Determinism Wraps Non-Determinism",
      "summary": "An autonomous Claude session without a wrapper is a brilliant intern with no manager, no deadline, and an unlimited API budget. The bash harness is the manager. Here's what it enforces, why it has to be deterministic, and the specific failure modes it prevents.",
      "content_text": "There's a moment in every autonomous-coding-cluster project where the engineer realizes their problem isn't the model. The model is fine. The model would happily spend twelve hours and $400 trying every possible solution to a misconfigured environment variable. The problem is the *manager*.\n\nClaude on its own (even Claude 4.6, even Claude with full tool use, even Claude with MCP and a perfect `CLAUDE.md`) has no concept of \"this task should have been done an hour ago\" or \"I've made the same mistake three times in a row.\" It doesn't quit. It doesn't get bored. It will burn your entire token budget if you let it.\n\nThe bash harness is the part of the cluster nobody writes blog posts about, because bash isn't sexy. It's also the part that makes the whole thing reliable enough to push to production. This is that post.\n\n## The framing\n\nCline put it best, in a documentation update I read after building most of our harness: *\"An autonomous agent without a bash harness is a brilliant intern with no manager, no deadline, and an unlimited API budget.\"*\n\nThat's the threat model. The model is the intern. The bash harness is the manager.\n\nThe intern is genuinely good. The intern can do work that would take a human three days. The intern can't *self-supervise*. If the intern wanders into a corner of your codebase that has no test coverage and decides to refactor it from scratch, the intern won't notice that's not what you asked for. If the intern can't figure out how to fix a bug, the intern won't say \"I give up.\" They'll keep trying, with progressively more elaborate workarounds, until something stops them.\n\nThe bash harness is what stops them. Specifically, it does six things.\n\n## What the harness does\n\n### 1. It runs Claude headlessly\n\nThe headline command is `claude -p \"your prompt here\"`. The `-p` flag puts Claude Code in non-interactive (headless) mode. Instead of opening a chat UI, Claude runs the prompt to completion, streams its work to standard output, and exits when done.\n\nWe pair `-p` with `--dangerously-skip-permissions`, which tells Claude to stop asking for confirmation before every tool call. In an interactive session, you want those prompts. In a headless session that needs to run unattended, you don't. (The `--dangerously-` prefix is honest. Skip permissions and the agent can do real damage. The bash harness is what makes this safe.)\n\nThe combination (headless mode plus skip permissions) is the atomic primitive of every autonomous coding system. If you only learn one thing from this whole series, learn that. Without it, the model is a chatbot. With it, the model is a worker.\n\n### 2. It enforces a wall-clock timeout\n\nIf Claude has been running for two hours, something is wrong. Either it hit a class of problem it can't solve, or it's looping, or the task was wildly under-scoped. Either way, killing it is the right move.\n\nOur harness wraps `claude -p` in a `timeout 7200` call (2 hours, expressed in seconds). When the timer hits, the process is killed, the work-tree is captured for forensics, and the harness logs a failure ticket for human review.\n\nThe number isn't sacred. We started at 8 hours, dropped to 4 when we realized we were paying for runaway sessions, and settled at 2 because every legitimate task we've measured fits well under that. Pick a number that's longer than your slowest legitimate task and shorter than \"I will go bankrupt.\"\n\n### 3. It detects hangs\n\nClaude can be running but not making progress. Sometimes it gets stuck on a tool call: a Docker container that takes forever to start, an MCP server that hangs, an HTTP request to a slow third party. The wall-clock timeout catches this eventually, but you don't want to pay for two hours of nothing.\n\nThe harness watches the agent's stdout for *tool call activity*. Every meaningful turn produces a tool call entry. If five minutes pass with no tool call output, the harness assumes the agent is stuck and kills it. The threshold is a tunable. We started at 2 minutes and bumped to 5 because some legitimate operations (large grep on a big codebase, slow MCP servers) take longer.\n\n### 4. It bans dangerous patterns\n\nThe agent will, occasionally, output a command that would do irreversible damage. Not from malice. From training data. Real examples we've caught:\n\n- `git reset --hard HEAD~5` (when the agent gives up and tries to wipe its work)\n- `git push --no-verify` (when CI fails and the agent decides to bypass the check)\n- `rm -rf /` (rare, but yes)\n- `--force-overwrite` flags on database migrations\n\nThe harness greps the agent's command stream for known-bad patterns. When it sees one, it kills the session and logs a ticket. We curate this list as we discover new ones.\n\nThis is defense-in-depth. The model *shouldn't* output these commands; we have rules in `CLAUDE.md` against most of them. But the model occasionally will, and \"shouldn't\" is not the same as \"won't.\" The harness is the airbag.\n\n### 5. It cleans up after the agent\n\nEach agent session runs in its own *git work-tree*: a separate checkout of the repo, isolated from every other agent's work. When a session ends, the work-tree might contain uncommitted changes, half-finished branches, leftover processes (dev servers, browser instances, MCP servers).\n\nThe harness runs a cleanup pass. It checks the work-tree for uncommitted changes (and either commits them with a salvage marker or stashes them for forensics, depending on session outcome). It runs `pkill` on any orphaned processes that started during the session. It deletes the work-tree if the session succeeded, or archives it if the session failed.\n\nWithout this, every failed session leaves debris. After a hundred sessions, your `~/code/worktrees` directory is a graveyard. Cleanup is unglamorous and load-bearing.\n\n### 6. It enforces the push step\n\nWhen a session succeeds, the agent should push its commit to a target branch. Sometimes the agent forgets, or mis-commits, or commits without pushing. The harness checks the work-tree state at the end of every session: are there committed-but-unpushed changes? If yes, the harness pushes them. Are there changes the agent forgot to commit? If yes, log a failure. The agent didn't finish.\n\nThis is the final guardrail. The whole pipeline ends with code in production. If the agent's work doesn't end up there, the entire run was wasted. The harness *makes sure* the work lands.\n\n## Why bash, specifically\n\nI wrote the first version of the harness in TypeScript. It was elegant. It had types. It also took ten times as long to debug as the bash version that replaced it.\n\nBash is the right language for this. Three reasons:\n\n1. **The harness wraps a Unix process, and bash is the language for wrapping Unix processes.** Subprocess management, signals, file descriptors, timeouts, pipes. All of this is expressed concisely in bash and verbosely in everything else.\n2. **The harness has to be debuggable in production.** When something goes wrong, you SSH to the machine and read the script. Bash scripts are visible. There's no compilation step, no module resolution, no runtime to set up. You can audit the entire harness in fifteen minutes.\n3. **The harness should be boring.** It's the deterministic core of an otherwise non-deterministic system. The fewer abstractions, the fewer dependencies, the better. Bash has no dependencies. It will work on any Linux box, with no setup, in five years.\n\nThe harness is currently 312 lines of bash. Most of those lines are comments. The actual logic is short.\n\n## Each agent gets its own work-tree\n\nA note on parallelism, because it's the part that surprises people.\n\nWe run eight agents at a time on my laptop. They don't share a working directory. They don't share a database. They don't share dev server ports. Each agent has:\n\n- Its own git work-tree (a checkout of the repo at a separate filesystem path).\n- Its own dev server, on a unique port (5180-5187, parameterized by worker slot).\n- Its own browser MCP target (Chromium instance, also on a unique port).\n- Its own log stream, prefixed with the worker slot for easy filtering.\n\nThe agents are completely isolated. Two agents working on conflicting changes won't step on each other. They're in different work-trees, on different branches. When they push, the push happens on the same remote, and any merge conflicts are resolved by the deploy pipeline (or, in the rare case where they're not auto-resolvable, surfaced as a failure ticket).\n\nThis is the part of the architecture I was most nervous about and it's worked the best. Eight independent workers running in parallel turns out to be the right level of concurrency for one developer's machine. Going higher saturates my Mac's resources. Going lower means tickets sit in queue.\n\n## The lesson\n\nEvery autonomous coding system you build will fail in ways the model can't recover from. The model is good at writing code. It's bad at noticing it's wasted three hours, bad at stopping itself, bad at cleaning up its mess. The bash harness is the part of the system that's good at those things.\n\nWithout it, you have a hobby project. With it, you have something you can leave running overnight.\n\nThis is the unglamorous core of every production-grade agentic system I've seen. Cursor has one (their internal harness for background agents). Devin has one. We have one. If you're building your own cluster, you'll build one too. Mine will save you a few weeks of mistakes.\n\nIn the final post of this series, I'll get into distillation, the part that turns short-term Anthropic-dependence into long-term independence, and the part the AI labs are quietly going to lobby against. If you've followed this series this far, distillation is the post that matters most for your business in 2027.\n\nIf you'd rather skip the year of building and use a cluster that already has all of this, [join the Alchemist waitlist](/#waitlist).",
      "date_published": "2026-05-12T00:00:00.000Z",
      "authors": [
        {
          "name": "Hunter Hodnett"
        }
      ],
      "tags": [
        "bash-harness",
        "autonomous-agents",
        "claude-code",
        "production-engineering"
      ]
    },
    {
      "id": "https://adaas.dev/blog/opinionated-stack-deno-svelte-cloudflare",
      "url": "https://adaas.dev/blog/opinionated-stack-deno-svelte-cloudflare",
      "title": "The Opinionated Stack: Why Autonomous Coding Needs Deno, Svelte, and Cloudflare",
      "summary": "Most autonomous-coding platforms try to support every language, every framework, every host. That's exactly why they don't work. We picked Deno, Svelte, and Cloudflare, locked the choices, and got autonomy at a scale that makes outside observers incredulous. The constraint is the feature.",
      "content_text": "The most common question I get from other founders building autonomous-coding platforms is some variant of, \"How do you support so many customer apps with so few engineers?\"\n\nThe answer they expect is a model trick, or a prompt trick, or some clever new agent framework. The real answer is the one nobody wants to hear: we picked one language, one frontend framework, and one cloud, and we refused to support anything else.\n\nDeno. Svelte. Cloudflare.\n\nThat's the whole stack. Every customer app we host runs on those three. Every agent we ship is written for those three. We don't take customers whose apps run on something else. We don't make the platform \"flexible enough to support React or Vue or Solid.\" We don't expose a hosting plugin layer. We took the three choices that compose best for an agent-driven world and we welded them in place.\n\nThat decision is the single highest-leverage call we've made at Chipp. It's also the call that most of our competitors keep refusing to make, and it's why their autonomy numbers stall out where ours keep climbing.\n\nThis post is the argument for why.\n\n## The thesis in one line\n\n**An autonomous coding platform's effective intelligence is inversely proportional to the size of its decision surface.**\n\nEvery place where the agent could plausibly pick option A versus option B versus option C is a place the agent will, eventually, pick wrong. Every framework choice, every cloud primitive, every package manager, every test runner, every router convention. Each one multiplies the number of paths the agent has to navigate. The model is good. It's not \"infinite combinatorial decision tree\" good.\n\nThe only known way to make an agent's behavior reliable at scale is to *remove decisions from the surface*. Not by adding rules on top of a sprawling toolchain, but by collapsing the toolchain itself. The autonomous coding platforms that ship working features today are the ones with the smallest possible \"what shall I pick\" surface. The ones that don't ship are the ones still proudly listing thirty supported frameworks on their landing page.\n\nConstraint is not a tax. Constraint is the feature.\n\n## What \"one size fits all\" actually costs\n\nLet me describe the platform shape I keep seeing pitched, because it sounds reasonable on a slide.\n\nThe pitch: a hosted agent platform. The customer brings their existing codebase. Python or Node or Go on the backend. React or Vue or Svelte on the frontend. Deployed to AWS or GCP or Vercel or Fly. The agent reads the customer's repo, figures out the conventions, writes a feature, runs the tests, opens a PR.\n\nIt demos well. It always demos well. The first ticket lands. The team congratulates itself. Then the second ticket comes in, on a different customer with a different stack, and the agent's accumulated context from the first customer is *useless*. Then the third ticket lands, with yet another framework combination, and the agent has to relearn everything again. Then the scar-tissue document you'd been building in `CLAUDE.md` doesn't transfer, because half its rules are about library quirks that don't apply to this customer's stack. ([Why scar tissue in `CLAUDE.md` is the highest-leverage file in your codebase →](/blog/claude-md-architecture))\n\nYou spend the next year of company-building writing per-stack adapters. Each adapter is fine on its own. The compound is fatal. Every adapter is a new place to debug, a new set of dependency versions to track, a new permission of failure modes for the agent to discover. The model never gets to compound. The platform never gets to compound. You're maintaining N agents pretending to be one agent, and each of them is mediocre.\n\nThe teams running this play out loud right now, you know who you are, will hit a wall where their agents work on toy customer apps and fail on real ones. They will conclude that the model isn't smart enough yet. The model is plenty smart enough. The platform forced the model to do combinatorial work that doesn't need to exist.\n\nThe exit from this trap is to *narrow the surface*. Pick the stack. Mean it. Refuse anything else. The customers you lose by being opinionated are smaller than the customers you'll lose by being a worse product.\n\n## Why Deno is the right language\n\nPick one server-side language. Once you've picked it, every reason to pick anything else for that tier collapses. The agent only ever has to know one ecosystem. The skills compound. The scar tissue compounds. The platform compounds.\n\nWe picked Deno. Here's why it dominates Node and Python and Go for this specific job.\n\n**One binary, one toolchain, one mental model.** Deno is a single executable that includes the runtime, the package manager, the formatter, the linter, the test runner, the type checker, and the bundler. There is no `npm` versus `pnpm` versus `yarn` versus `bun` decision. There is no `eslint` versus `biome` versus `oxlint` decision. There is no `prettier` versus `dprint` decision. There is no `jest` versus `vitest` versus `mocha` versus `node:test` decision. The agent runs `deno fmt`. The agent runs `deno lint`. The agent runs `deno test`. There is exactly one right answer to every \"how do I X\" question. The decision tree the agent has to navigate when working in a Deno repo is *embarrassingly* small compared to a Node repo, and that's the entire point.\n\n**TypeScript is the default, not the bolt-on.** No `tsconfig.json` archaeology. No \"did someone forget to add `@types/node`.\" No transpile pipeline. You write TypeScript, Deno runs TypeScript. The agent writes TypeScript, the agent runs TypeScript. Type errors surface immediately. The whole class of \"the agent generated valid JavaScript that's invalid in this project's TS config\" failure mode is gone.\n\n**Permissions are first-class.** `deno run --allow-net=api.stripe.com --allow-env=STRIPE_KEY ...` is the default shape of a Deno program. The agent's generated code declares what it touches. When the agent goes off the rails and tries to read `/etc/passwd`, the runtime refuses. The model's worst impulses get sandboxed for free. Compare this to Python or Node, where every script has the same authority as the developer running it.\n\n**The standard library is real.** `@std/http`, `@std/path`, `@std/encoding`, `@std/testing`, `@std/assert`. Things that should not require a package, do not require a package. The set of things the agent might `npm install` to solve a routine task shrinks. Fewer dependencies = fewer choices = less hallucination surface. The agent reaches for `@std/path` because the agent only knows about `@std/path`, because that's what the CLAUDE.md says to use, and that's what's already in the workspace.\n\n**URL imports are reproducible.** `import { foo } from \"jsr:@scope/package@1.2.3\"` is the import statement *and* the dependency declaration. There is no separate `package.json` to keep in sync. There is no `node_modules/` to corrupt. There is no lockfile divergence between the agent's working tree and the deploy. The agent that fixes a bug in a function and pushes it is *touching the same artifact* the runtime executes. This sounds boring until you've debugged the third \"but it worked on my laptop\" failure caused by a transitive dependency drifting in the lockfile.\n\nThere's a meta-property here that I think is underrated: **Deno is opinionated about the same things autonomous agents are bad at.** Toolchain selection. Configuration sprawl. Implicit globals. Hidden permissions. Every place Deno's designers said \"we're picking one way\" is a place we don't have to write a CLAUDE.md rule explaining which of the seventeen Node-flavored ways our codebase prefers. The framework already enforces the convention. The agent just has to follow the only available path.\n\n> \"Every language choice an agent can make is a choice the agent will eventually get wrong. The win is removing the choice, not training the agent to choose better.\"\n> — Hunter Hodnett, Chipp CTPO\n\nCould we have picked Go? Go has some of the same properties, single binary, single tool, opinionated formatter. We didn't, because the agent has to write *both* server code and shared code with the frontend. Sharing types across a Deno backend and a Svelte frontend is the same TypeScript file. Sharing types across a Go backend and a TypeScript frontend is a code generator and a build step. The friction of crossing the language boundary, multiplied by the number of features the agent ships per day, dwarfs the marginal Go-versus-TS productivity gap. One language, one type system, one mental model.\n\nCould we have picked Bun? Bun is closer to Deno than to Node, and on the runtime axis it's competitive. We didn't, because Bun's posture is \"drop-in Node replacement,\" which means it inherits Node's combinatorial config surface as a feature. Deno's posture is \"we picked the answers.\" We wanted the picks.\n\n## Why Svelte is the right frontend framework\n\nPick one frontend framework. Same logic. Same compounding.\n\nWe picked Svelte (Svelte 5, with runes). Here's why it dominates React for an agent-driven codebase.\n\n**Less code per feature.** This is the boring, load-bearing fact. A typical Svelte component that does what a React component does is somewhere between half and a third the line count. For a human reader, that's a quality-of-life thing. For an agent with a finite context window, it's *capacity*. Every additional line of code the agent has to load into context to reason about the surrounding feature is a line that crowds out something else. Svelte's compiler does the work React's hooks-and-providers ritual makes the developer do explicitly. The shorter source is the source the agent can hold in its head.\n\n**The compiler is the framework.** React ships a runtime that interprets your component tree at every render. Svelte compiles your component down to direct DOM mutations at build time. The agent doesn't have to reason about render orderings, dependency arrays, stale closures, the rules of hooks, when to memoize, when not to memoize, the seven flavors of effect, server components versus client components, suspense boundaries, or any of the other lore that has accumulated on top of React over a decade. The agent reasons about HTML, CSS, and a small number of runes. That's it.\n\n**Runes are a small, consistent surface.** `$state`, `$derived`, `$effect`, `$props`, `$bindable`. Five symbols. Each one does exactly one thing. There is no `useState` versus `useReducer` versus `useRef` versus `useImperativeHandle` decision. There is no \"is this a controlled component\" lore. The agent picks the rune by what it's doing, not by which of the eight historical APIs ended up at the right level of abstraction.\n\n**SvelteKit's file system is the routing.** `routes/foo/+page.svelte` is the foo page. `routes/foo/+page.server.ts` is the foo page's server load. `routes/foo/+server.ts` is the foo API endpoint. The agent doesn't pick between Pages Router and App Router. It doesn't pick between server components and client components. It doesn't pick between `getServerSideProps` and `getStaticProps` and `loader()`. The filename declares the role. There is one way.\n\n**No state management library debate.** React projects routinely include three different state libraries (Redux, Zustand, React Query, Jotai, MobX, take your pick) and the agent has to learn which one this codebase uses for what. Svelte projects use runes. State that's per-component lives in `$state`. State that crosses components lives in a `.svelte.ts` file with `$state` at module scope. The store debate that's eaten a decade of React community discourse simply does not exist.\n\nCould we have picked Solid? Solid is closer to Svelte than to React on every axis I care about. We didn't, because Solid is small enough that the customer-facing component ecosystem is thin. Svelte has the second-largest component ecosystem after React, which matters when the agent reaches for \"I need a date picker.\" Solid would have been the right pick if we were optimizing for performance benchmarks. We were optimizing for the agent's daily reach.\n\nCould we have picked React (Next, Remix, plain Vite)? No, for the same reason we didn't pick Bun: React's posture is \"everything is a choice, pick wisely.\" That is exactly the wrong posture for an agent. We don't want our agents picking wisely between four router libraries. We want the agent to not have the choice in the first place.\n\nThe component-count tax this saves us is real. A Chipp customer's average feature ticket, when shipped on a React stack, was around 3.2x the diff size of the same feature shipped on Svelte. 3.2x more code is 3.2x more places to put a bug, 3.2x more tokens to load on review, 3.2x more `CLAUDE.md` rules to write about which React patterns we use. Multiplied across the volume of features we ship per customer per week, the framework choice was the second-largest determinant of how much autonomy we could squeeze out of the platform. (The first was the model, and we don't control the model.)\n\n## Why Cloudflare is the right cloud\n\nPick one cloud. Same logic. Same compounding.\n\nWe picked Cloudflare. Workers for compute, R2 for blobs, KV for caches, D1 for relational data, Durable Objects for stateful coordination, Queues for async work, Pages for static assets, Custom Hostnames for per-customer domains, Zero Trust for auth gating. One platform, one CLI (`wrangler`), one mental model.\n\nHere's why it dominates AWS and GCP for an agent-driven backend.\n\n**One control plane, one API.** When an agent needs to provision a queue, it calls one wrangler command and an HTTP endpoint. It doesn't have to pick between SQS, SNS, EventBridge, Kinesis, MSK, and AppFlow. It doesn't have to know which of those has the right semantic for this customer's job shape. The agent has Queues. The customer has Queues. The decision tree has one branch.\n\n**The primitives are general-purpose.** Workers + R2 + D1 covers the shape of every customer app we've shipped. The agent learns those three. The agent ships features. The agent doesn't go on a multi-day expedition learning that the customer's logging system happens to require IAM role A to assume IAM role B to publish to topic C subscribed by Lambda D writing to S3 bucket E. That sentence is a parody of cloud onboarding, except it's also literally what we paid the AWS tax for in a previous life.\n\n**Edge-first is deterministic for the agent.** Workers run at the edge, the same way everywhere. There is no `us-east-1` versus `us-west-2` decision. There is no \"did we forget to deploy this to all regions\" failure mode. When the agent deploys, the deploy is global. When the agent verifies, the agent's verification reflects what the customer's customer will see, anywhere on earth. The \"works in dev, fails in prod\" gap, which is mostly a region/availability/IAM gap in AWS shaped systems, doesn't exist.\n\n**Wrangler is the agent's hands.** Every Cloudflare resource is reachable from `wrangler`, and `wrangler` runs in a shell, and the agent has a shell. We don't need a custom orchestration layer to give the agent provisioning capability. The agent writes `wrangler.toml`, the agent runs `wrangler deploy`, the agent reads the output. The same tool a human operator would use is the tool the agent uses, which means the agent's mistakes are debuggable by the same workflow that debugs a human's mistakes. ([Why the bash harness is the right abstraction →](/blog/the-bash-harness))\n\n**Workers are fast to deploy and isolated by default.** A `wrangler deploy` is typically completed by the time the agent's next tool call fires. That tight loop is what makes autonomous verification practical. The agent ships, the agent curls the deployed endpoint, the agent observes the response, the agent decides. If the deploy round-trip took the length of a coffee break, autonomous verification would simply not be a workable shape. Cloudflare's deploy primitives are *agent-paced*, which is a phrase I never thought I'd need to coin but which now describes the most important property a cloud can have for this kind of work.\n\n**Per-customer isolation comes for free.** Each customer's deploy is a separate Worker. Each customer's data is a separate D1. Each customer's blobs are a separate R2 prefix. Cross-tenant leakage is a tenancy-shape problem, not an IAM-policy problem. We don't write tenant-isolation middleware. We don't write tenant-isolation tests. The platform's primitive is the boundary.\n\nCould we have picked Fly? Fly is close to Cloudflare in posture (opinionated, edge-first, small surface). We didn't, because Fly is positioned as \"run your container anywhere,\" which puts the container-image debate back on the table. We didn't want our agent making Dockerfile decisions. Cloudflare Workers don't have a container concept at all. The agent writes TypeScript, the agent deploys TypeScript, the platform handles the rest. The whole containerization layer is a decision surface we eliminated.\n\nCould we have picked Vercel? Vercel is great at the React-flavored slice of the world, but its primitives outside that slice (blob storage, queues, durable state) are thinner and lean on third-party adapters. We didn't want our agent gluing together Vercel-plus-Upstash-plus-Neon-plus-Inngest, each with its own API and its own dashboard. Cloudflare's one-platform claim is the closest a cloud has come to \"one mental model for everything the agent needs.\"\n\nCould we have picked AWS or GCP? No. The hyperscalers are *the canonical bad shape* for autonomous coding. Three thousand services, fifteen ways to do every job, IAM policies that take human engineers years to internalize. We are not going to outwit AWS's combinatorial sprawl by being clever about prompts. We are going to avoid it by not standing on it.\n\n## What you give up\n\nI want to be honest about what this decision costs, because the \"constraint is the feature\" pitch glosses over real tradeoffs.\n\n**You lose customers who already have a stack.** A customer running a Rails monolith on Heroku is not our customer. We can't agent-ize their app. We don't try. The customers we sign are the customers who haven't committed to a stack yet, or who explicitly want to greenfield on ours because the autonomy multiplier is worth the migration.\n\n**You lose engineers who want to pick.** Some engineers want a platform that respects their craft, which often means respecting their existing toolchain choices. Our pitch is the opposite. The platform makes the choices. The engineer's craft is in the domain, not the stack. Some prospective hires bounce off this. The ones who stay are the ones who internalized that the stack stopped being a competitive advantage years ago.\n\n**You take on the bet that the chosen tools stay good.** If Deno collapses, we have a migration. If Cloudflare massively raises prices, we have a migration. If Svelte 6 makes runes obsolete in a way we can't follow, we have a migration. The constraint binds us as much as it binds the customer. We accept that bet because the alternative, supporting twelve stacks badly forever, is strictly worse and we can see the future where that company looks like a museum.\n\n**You can't be everything to everyone.** Good. The platforms that are everything to everyone are the ones whose autonomous-coding numbers stop at \"demo working\" and never reach \"running production for two months unattended.\" We'd rather be one thing to a smaller everyone, work well, and let the autonomy compound.\n\n## What this looks like in practice\n\nA customer-template repo with a `CLAUDE.md` that knows it's a Deno + Svelte + Cloudflare project. The scar-tissue rules are about Svelte 5 runes, Deno 2 APIs, wrangler conventions. There is no rule that says \"if this is a React project, do X; if this is a Vue project, do Y.\" The agent never has to branch on stack identity. The rules apply directly.\n\nA reproducible E2B sandbox with Deno installed, wrangler installed, Svelte's tooling installed, Chrome installed for browser-driven verification. Every agent run starts from the same base image. The agent never spends a tool call figuring out whether `pnpm` or `npm` is the right installer here, because the answer is always `deno` and the answer is in the base image. ([How we built that sandbox →](/blog/agentic-design-patterns))\n\nA deploy pipeline that's two commands: `wrangler deploy` and curl the result. The whole \"did the deploy land\" round-trip is short enough that the agent doesn't need a separate state machine to model it. The agent ships, the agent verifies, the agent moves on. The model's working memory is enough; we never had to build a \"track in-flight deployments\" subsystem because the deployment never has time to be in flight.\n\nA per-customer Worker that isolates everything. When the agent screws up, the blast radius is one Worker. When the agent ships a fix, the rollout is global and atomic. There is no rolling-deploy interlude where some customers are on the old code and some on the new. The agent doesn't have to reason about partial-rollout failure modes because partial-rollout doesn't exist.\n\nA frontend that ships fewer lines of code per feature, which means the review pass loads less context, which means the agent's `CLAUDE.md` review rules apply to a smaller surface, which means the same fixed budget of review attention catches more bugs.\n\nThese properties are not independent. They compose. Each constraint we accepted lets the others compose harder. The reason our autonomy numbers look the way they do is that the platform itself was designed to make autonomy the default path, and the platform's design is the stack choice.\n\n## The general principle\n\nIf you're building an autonomous coding platform, the most important spec is *not* your agent harness. It's not your model picker. It's not your prompt library. It's *the platform your agent ships code for*.\n\nThe agent is the smallest part of the system. The platform is the part you can shape. Shape it the way the agent's failures want you to.\n\nSpecifically:\n\n- **Pick one language.** The language with the best built-in tooling, the smallest config surface, and the strongest cross-tier story with your frontend.\n- **Pick one frontend framework.** The framework with the smallest source-code footprint per feature, the most opinionated conventions, and the smallest \"lore surface\" the agent has to learn.\n- **Pick one cloud.** The cloud with the smallest set of general-purpose primitives, the tightest deploy round-trip, and the most isolated per-tenant shape.\n\nThen refuse to compromise on any of those three. Refuse politely. Refuse with a smile. Refuse the customer who insists on Python. Refuse the engineer who insists on Next. Refuse the investor who tells you you'd be \"more flexible if you supported AWS.\" Flexibility is what the platforms that don't ship have. We picked the platforms that ship.\n\nWhen someone tells you the model isn't smart enough to be autonomous on their stack yet, take a look at their stack. Most of the time, the model is plenty smart. The stack is the bottleneck. The decision surface is too wide. The combinatorial cost of every \"which library\" question is eating the model's working memory. Shrink the surface and the same model gets dramatically more autonomous overnight.\n\nThis is the lesson I wish more of the autonomous-coding ecosystem would internalize. We are not in the era of \"the model will figure it out.\" We are in the era of \"the platform will let the model figure it out, or it won't.\" The platforms that pick will win. The platforms that don't will spend the next year of company-building writing per-stack adapters and wondering why their demos plateau.\n\nDeno. Svelte. Cloudflare. Three picks. One stack. Autonomy compounds.\n\nThat's the post. The follow-up is the platform itself, which is what we ship on every weekday at Chipp, and what the [autonomous development manifesto](/blog/the-autonomous-development-manifesto) lays out in full.\n\nIf you're building toward the same shape, I'd love to compare notes. If you're building toward the everything-to-everyone shape, I'd love to come back in a year and compare numbers.",
      "date_published": "2026-05-10T00:00:00.000Z",
      "authors": [
        {
          "name": "Hunter Hodnett"
        }
      ],
      "tags": [
        "autonomous-development",
        "deno",
        "svelte",
        "cloudflare",
        "stack-design",
        "opinionated-platforms"
      ]
    },
    {
      "id": "https://adaas.dev/blog/beyond-ai-pair-programming",
      "url": "https://adaas.dev/blog/beyond-ai-pair-programming",
      "title": "Beyond AI Pair Programming: From Copilot to Coworker to Autonomous Engineer",
      "summary": "Copilot was the start, not the destination. Three roles for AI in your engineering org, pair programmer, coworker, autonomous engineer, and the org-chart implications when you move from one to the next. Plus how to upgrade your own job description through the curve.",
      "content_text": "GitHub Copilot launched in October 2021. The pitch was *AI pair programmer*. A model that finishes your line, suggests the next function, helps you remember the syntax for that thing you keep looking up.\n\nThe pitch was honest. The category name was ambitious. Pair programming, as practiced by humans, is two people taking turns at one keyboard. One drives, one navigates, both think. Copilot was *not* that. It was a smarter autocomplete with a better marketing department.\n\nThe honest pitch, *autocomplete that actually understands your codebase*, would have been more accurate but less aspirational. Marketing won. The \"AI pair programmer\" framing stuck. And four years later, the framing is the thing holding most builders back from seeing what AI in engineering can actually do.\n\nThis post is about the upgrade path. Three roles for AI in your engineering org, pair programmer, coworker, autonomous engineer, what each one actually means, and the org-chart implications when you move between them.\n\n## Role 1: AI as pair programmer\n\nThis is where Copilot started, where Cursor lives, where Cody and Tabnine and Continue all sit.\n\nThe model finishes your line. The model writes a function when you describe one in a comment. The model fixes a syntax error you didn't notice. The model is *latency-bound*, it has to be fast enough to keep up with your typing, which means it has to be small enough to run in real time, which means it can't reason very deeply about what you're doing.\n\nThe output of an AI-pair-programmer setup looks like a slightly faster human engineer. You ship 10–30% more code per hour than you would have without it. The shape of your day is unchanged. You're still typing all day. You're still reviewing every line. You're still the one who decides what file to open next and what to do once you're there.\n\nThis is a real productivity gain. It's not a category change.\n\nWhat you give up to operate at this level: nothing. Adoption is friction-free. Every engineer on your team can be using a pair programmer by tomorrow afternoon and there are no architectural changes required.\n\nWhat you fail to gain by stopping here: everything below.\n\n## Role 2: AI as coworker\n\nThe coworker shows up around 2024 with the agent capabilities baked into Claude Code, Cursor's agent mode, the Codex CLI, and a few others.\n\nThe shape of the work changes. Instead of the model finishing your line, the model takes a goal and *runs a session* against the goal. You say \"investigate why our billing endpoint is returning 500s\" and the model spends ten minutes reading files, querying logs, forming a hypothesis, writing a fix, running tests, telling you what it did.\n\nYou're still in the loop. You watch. You course-correct. You decide when the session is done. But you're not at the keyboard the whole time anymore. The model is operating in *minutes-long blocks* of focus, and you're operating in the spaces between.\n\nThe output looks like having a junior engineer at the desk next to you. You hand them tickets. They take a stab. You review. The throughput per ticket goes way up, junior engineers are bottlenecked by their own typing speed; the agent isn't. The throughput of *your* time goes up because you're not typing during the agent's blocks; you're handling other things.\n\nWhat you give up to operate at this level: trust. You have to let the agent run for ten minutes without watching every keystroke. The first time it makes a wrong-turn decision and burns ten minutes of context, you'll want to go back to the pair-programmer model. The teams that hold the trust line through the first month make it past this. The teams that don't end up reverting.\n\nWhat you fail to gain by stopping here: scale. The coworker is still bounded by your attention. You can run *one* coworker session at a time because you have to be available to course-correct it. You can't sleep through the night because you have to be available to review what the coworker shipped during the day.\n\nThis is where most ambitious teams sit in mid-2026. It's a good place. It's also not the destination.\n\n## Role 3: AI as autonomous engineer\n\nThe autonomous engineer is the role most builders haven't seen yet, because it requires a different *organizational* setup, not just a different tool.\n\nThe model takes a goal. It runs a session. It verifies its own work, opens a browser, takes a screenshot, reads logs, fixes the things it broke. It commits. It pushes. It deploys. There is no human at any step.\n\nMultiple autonomous engineers run in parallel. We have eight workers running on a single workstation at Chipp. Each one is its own Claude Code instance, in its own git worktree, with its own dev server on its own port, fielding tickets from a shared queue.\n\nThe output isn't *more code per hour*. The output is *more shipped product per dollar of payroll*. We ship 20–30 production changes a day on a two-person engineering team. The PR queue is empty because there are no PRs. The on-call rotation is empty because the cluster handles its own incidents.\n\nWhat you give up to operate at this level: your PR review process, your habit of reading every diff, the comfort of knowing exactly what's in production. ([The full list of trade-offs →](/blog/vibe-coding-vs-autonomous-development#what-you-give-up))\n\nWhat you fail to gain by stopping at the coworker level: the ability to run while you sleep. The ability to compete on engineering capacity with teams an order of magnitude larger than yours. The ability to ship features that wouldn't have survived a normal cost-benefit analysis.\n\n> \"The AI wasn't an assistant. It was the engineer. I was the manager.\"\n> — Hunter Hodnett, Chipp CTPO\n\n## The org chart at each level\n\nHere's where it stops being about tools and starts being about people.\n\n### Pair-programmer org chart\n\nThe org looks like a normal engineering team. Engineers report to engineering managers. Engineers write code. EMs run people. There are PRs. There are reviews. There are sprint planning meetings.\n\nThe shape doesn't change because the work doesn't change. The engineers are 10–30% faster but they're still doing what they were doing.\n\n### Coworker org chart\n\nThe shape starts to shift. Engineers spend less time typing and more time *steering*, defining tickets, reviewing the agent's work, fixing the parts the agent got wrong. The senior engineer's day starts looking more like a manager's day.\n\nEMs notice that some of their direct reports are shipping at suspicious volumes. The \"10x engineer\" trope, previously folklore, starts to look real.\n\nThis is also where the awkward middle starts. Junior engineers struggle, because the agent is good at the work juniors used to learn from. The \"draw the rest of the owl\" jokes hit different when the agent draws most of the owl and the junior is supposed to learn by drawing it themselves.\n\n### Autonomous-engineer org chart\n\nThe shape is *fundamentally* different. There are no engineers writing code in the way the prior org charts had. There are *engineering managers*, humans who manage clusters of agents. There are *architects*, humans who decompose product intent into ticket-shaped work and review the agent's outcomes. There are *infra people*, humans who maintain the cluster, the harness, the MCP fleet.\n\nHeadcount drops. The remaining roles get *more* important, not less. A two-person autonomous engineering team can deliver what a fifteen-person traditional team could. Each of those two people is doing higher-judgment work than any of the fifteen.\n\nThis is the org chart most companies in 2030 will have. Most companies in 2026 do not. The transition is the hard part.\n\n## Upgrading your own role\n\nIf you're a working engineer reading this, the question is what to do about your own job.\n\nI'll give you the honest answer. Your job in the coworker world is what your job in the autonomous-engineer world is going to be: less typing, more decomposition; less review, more architecture; less owning code, more owning the system that produces code.\n\nThree things to start doing now:\n\n### 1. Stop measuring yourself by lines of code\n\nLines per day was a noisy metric before AI; it's worse than noisy now. The agent can produce more lines than you in any given hour. If your self-image is wrapped up in being a productive typist, the next two years are going to feel bad. Decouple now.\n\nThe new metric is *tickets shipped per dollar of token spend*. Or *features merged per week of clock time*. Or *production stability holding while shipping volume goes up*. Find a metric that values judgment over typing.\n\n### 2. Get good at decomposing\n\nThe hardest part of running an autonomous cluster is turning vague intent into pipeline-sized tickets. *\"Build a billing dashboard\"* is not a ticket. *\"Add a usage breakdown chart to the existing billing settings page, sourced from the same data the credit balance uses, with the design pattern from the org members table\"* is a ticket.\n\nDecomposition is a skill. It's the skill of being a good tech lead, but more granular. Start practicing it now even on work you're doing yourself. You'll be doing nothing else in two years.\n\n### 3. Get good at reading agent output fast\n\nIn the autonomous-engineer world, you'll be reviewing more of the agent's work than your own, hundreds of small commits per week, mostly auto-verified, but with a few that need human eyes for judgment calls.\n\nThe skill of speed-reading a diff and forming an opinion in 60 seconds is undervalued today. It will be the most valuable engineering skill of 2028.\n\nStart practicing on the agent's output now. Set a timer. Make snap calls. Build the muscle.\n\n## What this means for hiring\n\nIf you run engineering hiring, the implication is that the profile you want is shifting fast.\n\nThe pair-programmer-era hire was a strong typist with good fundamentals, someone who could turn a Jira ticket into shipped code reliably.\n\nThe autonomous-engineer-era hire is a strong *judge* with good fundamentals, someone who can look at an agent's output and tell you whether it's right, wrong, or interesting; someone who can decompose a vague product brief into pipeline-sized tickets; someone who's comfortable trusting a cluster to push to production while they sleep.\n\nThe fundamentals haven't changed. Algorithms, data structures, system design, debugging instincts, those still matter, because you can't judge an agent's output if you don't understand the underlying engineering. What's changed is the layer the human operates at. You're not the engineer anymore. You're the engineering manager, for a team of agents that doesn't go home at night.\n\nHire for that. Pay for that. Promote for that. The teams that figure out the new hiring profile first will end up with the engineers everyone else wants in 2028.\n\n## Which kind of engineer do you want to be in 2027?\n\nThere are three honest options.\n\n**Option 1: Pair-programmer engineer.** Same job you have today, slightly accelerated. Plenty of teams will still operate this way in 2027. They'll be losing market share. You'll be paid like a 2024 engineer with a faster IDE. Defensible if you're senior enough to ride out the contraction.\n\n**Option 2: Coworker-era engineer.** You've absorbed the agent into your daily flow. You're 5–10x more productive than your 2023 self. You're still in the inner loop on every change. This is the median ambitious engineer in 2026. It's a defensible position for two more years, then it gets hard.\n\n**Option 3: Autonomous-engineer-era engineer.** You manage agents. You decompose work. You design systems. You judge outcomes. You're the EM of a cluster that ships more product than your old fifteen-person team did. This is the role that compounds.\n\nI know which one I'm betting on. We're building Alchemist for the people making the same bet.\n\n**[Join the Alchemist waitlist →](/#waitlist)**\n\n---\n\nIf you want the long-form case for the destination, read [The Autonomous Development Manifesto](/blog/autonomous-development).\n\nIf you want the architectural difference between Stage 3 and Stage 5, read [Vibe Coding vs Autonomous Development](/blog/vibe-coding-vs-autonomous-development).\n\nIf you want to see what an autonomous cluster looks like under the hood, read [Building a Self-Healing Bug Bot](/blog/self-healing-bug-bot).",
      "date_published": "2026-05-07T00:00:00.000Z",
      "authors": [
        {
          "name": "Hunter Hodnett"
        }
      ],
      "tags": [
        "ai-pair-programmer",
        "ai-software-engineer",
        "autonomous-software-engineer",
        "org-chart"
      ]
    },
    {
      "id": "https://adaas.dev/blog/the-autonomous-development-manifesto",
      "url": "https://adaas.dev/blog/the-autonomous-development-manifesto",
      "title": "The Autonomous Development Manifesto",
      "summary": "Why we're building Alchemist AI: a working theory of how autonomous coding agents change the unit economics of software, who wins, who's at risk, and what the next three years look like from the inside.",
      "content_text": "The printing press got invented in 1440. Roughly a century later, the entire economic and political structure of Europe was different. Not because the press had changed anything directly — the press just printed paper — but because the act of distributing knowledge had been delaminated from the act of producing it. Once that came apart, every system built on that coupling came apart with it.\n\nSoftware is in 1440 right now.\n\nFor the last fifty years, producing software and distributing it have been welded together. To ship a feature, you needed someone — usually a credentialed someone — to hand-write the lines. To run a software business, you needed a roster of those someones. Every SaaS price tag is, ultimately, a way of paying down the cost of the team that wrote the code.\n\nAutonomous coding agents are pulling those two halves apart. The act of producing software is delaminating from the act of running a software business. That's the thesis, and it's the thesis Alchemist is built on.\n\nI'm going to make the case for what's actually happening, what isn't, and why we're betting our company on it.\n\n## What changed\n\nI've been a software engineer at Reddit, at Amazon Music, at Home Depot. I've been an engineering manager. I've shipped a lot of code by hand. None of that prepared me for the last six months at Chipp.\n\nWe don't write code anymore. Scott — my co-founder — and I have eight Claude Code sessions running in parallel on my laptop right now, and they're shipping features straight to production. Not pull requests. Not \"AI-assisted commits.\" Production. There is no human review gate between the agent and the live system. We deploy 20 to 30 times a day. Two months ago I started sleeping through the night for the first time since we raised our seed round, because the agents — not me — get paged when production breaks, and they fix it before I wake up.\n\nThis isn't a hype reel. It's a normal Tuesday. The autonomous cluster has been running in production for two months. It's how we built our agentic commerce protocol implementation. It's how we built our voice-agent stack. It's how our customer support works — when a customer Slacks us a bug, an AI agent reads our codebase, writes the fix, verifies it in a real browser, pushes it to production, and posts the result back in the channel. Median time from \"this is broken\" to \"this is fixed in prod\" is around thirty minutes, and we're trying to get it to ten.\n\nI tell people this and they nod politely and assume I'm exaggerating. I'm not. The reason I'm not is the reason this manifesto exists.\n\n## Why now and not last year\n\nThree things compounded.\n\nFirst, the models got good enough. Claude Opus 4.6 was the inflection point for us. Below that line, the agent would write code that looked plausible but didn't run; it would hallucinate function signatures; it would take twelve turns to do what should have taken two. Above that line, you can hand it a bug report and walk away. Opus 4.6 is the printing press of this analogy — without it, none of the rest matters.\n\nSecond, the *tool surface* got rich enough. Two years ago, an LLM was a chat interface. Now it's a programmable agent that can read files, run shell commands, query a database, drive a browser, hit your production logs, and call any HTTP API on the internet. The Model Context Protocol — Anthropic's USB-for-AI standard — is the boring detail that makes this whole thing work. The model still hallucinates plenty. But now it can *check itself*. It can take a screenshot of the page it just built and notice that the button is the wrong color. It can read its own server logs after the deploy and notice that the error rate spiked. The hallucinations stop costing you when the agent has hands and eyes to verify with.\n\nThird, *context engineering* turned out to be a real skill. The first time I tried to get an agent to ship a feature autonomously, it failed. Not because the model was dumb. Because I dumped a 200,000-token codebase into its context and asked it to fix something on line 14,000. We had to figure out — and we did, painfully, over thousands of dollars in token spend — how to load only the relevant context into a finite window, how to summarize without losing the load-bearing details, how to chain agent runs together so the output of one becomes the input of the next without information loss. That's not a model improvement. That's a software engineering discipline. Anyone can learn it. Most people haven't yet.\n\nWhen those three things land together — capable model, rich tool surface, learned discipline — you get autonomous software development. Which is what Alchemist is.\n\n## What this changes\n\nThe unit economics of software flip.\n\nBefore: a feature costs you one engineer-week. Maybe two. The engineer gets paid whether the feature is valuable or not. To run a software business, you stockpile engineer-weeks. That's why software companies look like software companies — most of the budget is people, most of the people are engineers, most of the engineers are working on things you can't sell yet.\n\nAfter: a feature costs you a few dollars in API tokens and the time it takes you to *describe* it. Most of our changes at Chipp cost between $2 and $4 in token spend, end to end — research, implementation, code review, deploy. A senior engineer's hourly rate, by comparison, is somewhere north of $150 an hour fully loaded. The arithmetic is grim if you're a ten-thousand-person engineering org and exhilarating if you're two people trying to ship a venture-backed product.\n\nThis is the part that most public commentary gets wrong. The story is not \"AI replaces engineers.\" The story is \"AI replaces the *bottleneck of engineer time*.\" Engineering judgment — what to build, what to skip, what's worth shipping rough versus polishing — does not get cheaper. It might get more valuable. What gets cheap is the labor of typing the code. That used to be the limit. Now it isn't.\n\nA handful of consequences fall out of that.\n\n**Tiny TAMs become buildable.** A \"total addressable market\" too small to justify the engineer-weeks is not too small to justify a weekend. There's exactly one HOA-management SaaS in the country that serves Upstate South Carolina specifically — there isn't, actually, but you see where I'm going. The HOAs hate the national tools, the national tools can't justify customizing for them, and a builder with an autonomous cluster can spin one up in a week that's better, cheaper, and locally serviced. The barrier was always the cost of the build, not the size of the market. The build cost just collapsed.\n\n**SaaS unbundles.** Big SaaS companies are priced to amortize the engineering team that built them. That price is now an arbitrage opportunity. You can build the slice of the SaaS your customer actually uses for one to two orders of magnitude less, charge less than the incumbent, and be the local face the incumbent can never be. We have customers right now selling Chipp-hosted agents at 90% margins on top of us, to enterprises whose previous spend was on roll-your-own engineering teams. None of these customers are software people. They're domain people who learned how to describe a feature.\n\n**Every engineer becomes a manager.** Not in the headcount sense — most companies are about to have *fewer* engineers, not more. In the cognitive sense. The day-to-day work of a senior engineer is becoming what a senior engineer's day used to look like in the rare moments when they had a productive intern. You assign tickets, you review the work, you push back on the bad calls, you absorb the merge. The agent does the typing. If you've been a strong manager and a mediocre coder, this is your moment. If you've been a strong coder and a weak manager, you have homework.\n\n## What hasn't changed\n\nI'm going to be honest about the limits, because the maximalist version of this story is wrong and the people selling it are about to get a lot of folks hurt.\n\nAutonomous agents fail. They fail about 20-30% of the time at Chipp, even on tickets we've tuned the system for. When they fail, the failure mode is usually one of three things: the prompt didn't have enough context, the context window overflowed and the agent forgot what it was doing, or the agent hit a class of problem (cross-cutting performance work, ambiguous product calls, anything requiring you to hold the whole system in your head) that the model genuinely cannot do yet.\n\nThe first two are fixable with better engineering. We've been chipping away at them for a year. The third is a model problem, and you're at the mercy of Anthropic's release schedule.\n\nThere's also a platform-risk story that nobody likes to talk about. Right now, Anthropic's models are the best for coding, by a margin. If you build your business on top of Claude — like we have — Anthropic eventually owns your margin. That's the deal you're in. The defense against that is *distillation*: training your own smaller model on the outputs of the frontier model, locking in the parts of your workflow that won't change for a year. We've been distilling for two months. It's the long game, and it's the part of this that the AI labs are going to lobby to make illegal, because it's the only thing that prevents a one-lab monopoly from owning the entire software industry. Watch that fight. It's the most consequential AI policy debate of the decade and almost nobody is paying attention.\n\n## What we're building\n\nAlchemist is the cluster I just described, packaged as a product anyone can use.\n\nYou describe what you want. We deploy an autonomous engineering team — research agent, implementation agent, code-review agent, documentation agent, deploy agent — that builds it. The output is a real codebase, in a stack we've spent six figures of token spend tuning (Deno on the server, Svelte on the client, Cloudflare for delivery), running on infrastructure we've made cheap. You can use the platform forever, or you can eject — pull the GitHub repo, take the code, run it yourself, and stop paying us. We optimized for that on purpose. If we ever raise prices into a corner, you walk away and we deserved it.\n\nThat's the bet. The bet is that the same way 1440 didn't end up being about the press itself but about everything the press *uncoupled*, the next decade isn't about coding agents. It's about everything that comes apart once code stops being scarce.\n\nThe companies that are going to be huge in 2030 are the ones building right now, while everyone else is still arguing about whether the agents really work. The agents really work. We've been running them in production for two months. The window where this is a contrarian take is short.\n\nIf you want to be early, [join the waitlist](/#waitlist).\n\nIf you want to understand the engineering — the context-engineering tricks, the self-healing loop, the bash harness, the distillation moat — I'm writing a series of posts that goes deep on each piece. The series picks up where this one ends.\n\nWe'll see you on the other side.\n\n— Hunter Hodnett, co-founder, Alchemist AI",
      "date_published": "2026-05-06T00:00:00.000Z",
      "authors": [
        {
          "name": "Hunter Hodnett"
        }
      ],
      "tags": [
        "autonomous-software-development",
        "manifesto",
        "founder-voice",
        "ai-agents"
      ]
    },
    {
      "id": "https://adaas.dev/blog/vibe-coding-vs-autonomous-development",
      "url": "https://adaas.dev/blog/vibe-coding-vs-autonomous-development",
      "title": "Vibe Coding vs Autonomous Development: The Maturity Curve from Prompt to Production",
      "summary": "Vibe coding is the second-best place to be in 2026. Autonomous development is the best. The two get conflated constantly, they're separated by one hard architectural step. Here's the maturity curve, what each stage actually means, and how to climb from one to the other in 90 days.",
      "content_text": "A friend asked me last month whether vibe coding and autonomous development were the same thing.\n\nI gave him a long answer. He cut me off. *\"Just tell me which one I should be doing.\"*\n\nThe short version: vibe coding is the second-best place to be in 2026. Autonomous development is the best. They are not the same thing, and the gap between them is the most consequential architectural decision a builder will make this year.\n\nThis post is the long version of that answer. It walks the five-stage maturity curve from autocomplete through autonomous, defines each rung honestly, and lays out a 90-day plan for climbing from vibe coding to autonomous development without the moves that usually go wrong.\n\n## The maturity curve\n\nThere are five stages of human-AI coding collaboration. Each one absorbs the prior one. Stage 5 contains every move from Stages 1 through 4, but the unit of work changes at every step.\n\n**Stage 1. Acceleration.** A model finishes your line. You're still the author. Output velocity: 1.1x. Your fundamental job hasn't changed.\n\n**Stage 2. Augmentation.** A model writes a function. You read it, edit it, commit it. The senior engineer still does most of the thinking. Output velocity: 2x. Copilot's original pitch.\n\n**Stage 3. Vibe coding.** A model writes most of the code. You become an editor in the loop, accepting or rejecting diffs in conversation with the agent. Output velocity: 5–10x for a session. Demos are great. Production code is hit or miss. **This is where most of the industry sits today.**\n\n**Stage 4. Agentic coding.** The agent runs tools, files, shell, browser, database, to accomplish a goal you stated. With the right setup, the agent can verify its own work. The human is still launching and supervising each session. Output velocity: 20–50x for a session. Each session needs a person to start and watch.\n\n**Stage 5. Autonomous development.** Multiple agents run unattended in parallel. Goal-directed. Self-verifying. The human role is decomposition (turning intent into tickets) and judgment (reviewing outcomes). Output velocity is no longer a useful metric, *organizational capacity* is.\n\nThe story most builders tell themselves is that Stages 3 and 5 are the same thing with more polish. They're not. Stage 5 contains an architectural commitment Stage 3 doesn't have, and getting from one to the other is most of the work.\n\n## What \"vibe coding\" actually means\n\nAndrej Karpathy coined the term in February 2025: *\"give in to the vibes, embrace exponentials, and forget that the code even exists.\"* The agent does the work; the human steers from the back seat.\n\nIn practice, vibe coding means three things:\n\n1. The human is in the loop on every change. You see a diff. You accept or reject it. You re-prompt when something's wrong.\n2. The agent doesn't verify its own work. Verification is the human's job, you click around, check the output, look for bugs.\n3. The session runs in real time, with the human watching. There's no batch mode. There's no overnight queue.\n\nVibe coding is *fast*. A single session can ship a feature that would have taken a week of by-hand engineering. It's also *flow-based*. The output of a vibe-coding session depends on the human's attention, taste, and ability to course-correct in real time.\n\nThe thing it isn't is *scalable*. The bottleneck on a vibe-coding setup is the human. You can't run twelve vibe-coding sessions in parallel because you can't watch twelve diffs at once. Your output velocity is bounded by your ability to review.\n\nThis is the ceiling Stage 3 hits. Most of the industry has hit it. The teams who think they're at the frontier of AI coding because they ship every day in Cursor or Claude Code interactively, they're at Stage 3. They're operating well. They are also one architectural step away from a different category of business.\n\n## What autonomous development means (the same definition, with the contrast sharpened)\n\n[Autonomous development](/blog/autonomous-development) is what happens when you remove the human from the inner loop.\n\nThe agent gets a goal. It executes against the goal. It verifies the result. It pushes the verified result to production. There is no human in any of those steps. The human's role is upstream of the work (decomposing goals into tickets) and downstream of it (judging outcomes), not inside the work itself.\n\nThis is not vibe coding with extra polish. It's a different architecture.\n\nIn autonomous development:\n\n1. The session runs without a human watching. You launch it and walk away. It might run for thirty minutes. It might run overnight.\n2. The agent verifies its own work. Browser MCP for UI. Test suite for code. Logs for runtime behavior. The agent doesn't trust itself; the agent *checks* itself.\n3. Multiple sessions run in parallel because no human is in the loop on any individual session. Eight workers, each on its own port, each in its own git worktree, each shipping independently.\n\nThe architectural commitment that separates Stage 5 from Stage 3 is **the verification loop**. Without it, you can't autonomy. With it, you don't need a human in the loop. Everything else in autonomous development, the bash harness, the multi-stage pipeline, the doc auto-load, the sub-agent dilution, exists to support the verification loop or to clean up after it.\n\n> \"Vibe coding ends with the diff. Autonomous development ends with verified production code.\"\n> — Hunter Hodnett, Chipp CTPO\n\n## Why most teams stop at Stage 3\n\nTwo reasons. Both are honest.\n\n**Reason 1: Stage 3 is genuinely good.** Vibe coding ships features faster than the manual baseline. Builders feel productive. Customers get more software. Investors see velocity. The pain that would push a team to climb to Stage 5 doesn't exist as long as the team is happy with Stage 3 throughput.\n\n**Reason 2: Stage 5 is genuinely scary.** It requires deleting your PR review process, trusting the cluster to push to production, and rebuilding your incident response around an autonomous self-healing pipeline. The first two weeks of running autonomously are *deeply* uncomfortable for engineers used to controlling every commit.\n\nThe teams that stay at Stage 3 forever are the ones who decide the comfort is worth more than the velocity. That's a defensible position right up until a competitor crosses to Stage 5.\n\nWhen that competitor exists, Stage 3 becomes untenable. They're shipping at output velocities the human-in-the-loop architecture can't match. Their bug-fix latency is measured in minutes, not days. Their on-call rotation is empty. Their engineers spend their time on architecture and judgment instead of typing and reviewing.\n\nYou can stay at Stage 3 against a Stage 3 competitor forever. You can stay at Stage 3 against a Stage 5 competitor for about a year.\n\n## How to climb from Stage 3 to Stage 5 in 90 days\n\nMost teams who try to climb fail because they try to climb in one move. They don't. The climb is six discrete steps, and skipping any of them produces a half-implementation that's worse than where you started.\n\nI'll lay them out in the order they should happen.\n\n### Days 1–14: Build a verification loop\n\nThis is the most important step and the one that gets skipped most often.\n\nPick one of your features. Build a browser MCP that knows how to spin up your dev server, navigate to the relevant page, take a screenshot, and read the console logs. This doesn't have to be your *production* dev server, a local Chromium instance and a small custom MCP wrapping it is enough.\n\nThen prove the loop end-to-end. Have the agent make a deliberately broken change. Run it through the verification loop. Watch it catch the break and fix it. If the loop works on a single deliberately-broken case, it'll work on the harder cases.\n\nIf the loop doesn't catch the break, you don't have a verification loop. Iterate until it does.\n\n### Days 15–30: Move from one context window to a multi-stage pipeline\n\nTake the workflow you've been doing in one Claude Code session, investigate, write code, review, push, and split it into stages. Each stage is its own session. Each session reads only the markdown file the prior stage wrote.\n\nTwo stages is enough to start: a *plan* stage that outputs a `plan.md` describing what to do, and an *execute* stage that reads `plan.md` and does it. Add review and docs stages later.\n\nThe discipline you're building here is *fresh-context handoff*. Once it's habitual, your sessions stop running out of context budget, your hallucinations drop, and your token spend per ticket goes *down* even though you're using more sessions. ([Why this works →](/blog/context-engineering))\n\n### Days 31–45: Build the bash harness\n\nYou can't run a session unattended without a manager. The bash harness is the manager. It enforces timeouts, kills sessions that hang, bans dangerous commands, forces commits, cleans up worktrees.\n\nStart with the skeleton in [the bug bot post](/blog/self-healing-bug-bot#component-2--the-bash-harness). Tune the timeouts to your workload. Add bans for any dangerous flag you've ever seen Claude attempt.\n\nThe harness is the thing that lets you walk away from the session. Without it, autonomous development is a research demo. With it, it's a production system.\n\n### Days 46–60: Wire up production triggers\n\nUntil now, you've been launching sessions manually. To get to Stage 5, sessions need to launch *themselves*, from production errors, customer reports, performance alerts.\n\nPick one trigger. We started with a Grafana webhook firing on production errors. Slack tag is the second-easiest. Email forward is the third. Anything that turns a real-world signal into a ticket in your queue is a trigger.\n\nOnce tickets land in your queue without you typing them, the cluster starts to feel autonomous. Because it is.\n\n### Days 61–75: Delete your PR queue\n\nThis is the step that separates the teams who actually reach Stage 5 from the teams who half-implement.\n\nWhen the cluster pushes a fix it has verified itself, the PR is the wrong layer. The verification has already happened. The PR is just a delay.\n\nDelete the PR. Push to staging. Let the deploy go. The cluster will catch its own breaks via the same trigger system that catches everyone else's.\n\nThis will feel wrong. Senior engineers will object. The objections are the right shape, *what if the cluster ships something bad?*, and the answer is *the cluster will fix what it ships, faster than any review queue would have caught it.* You have to either trust the loop or stay at Stage 3.\n\n### Days 76–90: Add the documentation auto-build\n\nThe last discipline. Every successful autonomous run should write or update markdown documentation in your `/docs/` folder. Future runs read those docs as context. The system gets smarter over time.\n\nThis is the part that compounds. After a quarter, your `/docs/` folder is the textbook of your codebase. After a year, it's a moat, your cluster works better on your codebase than any general-purpose autonomous system could, because it has the documentation no one else has.\n\nBy day 90, you're at Stage 5. Not perfectly. Not for every kind of work. But the architecture is in place, and from here it's incremental tuning.\n\n## What you give up\n\nHonesty matters. Climbing to Stage 5 costs you things.\n\n**You give up the ability to read every diff.** This is the hardest one for engineers attached to craft. You will, sometimes, see code in production you didn't write. Most of the time it'll be fine. Some of the time it'll be ugly. The pattern catches up over time as your `CLAUDE.md` accumulates style rules, but the first month is rough.\n\n**You give up the dopamine of fixing bugs yourself.** Bug fixing is satisfying. The autonomous cluster steals that satisfaction. You'll have to find your dopamine in architecture, judgment, and the kinds of work the cluster can't do.\n\n**You give up some headcount leverage.** You'll have a harder time hiring engineers who want to write code all day, because the cluster is doing most of that. You'll attract a different profile, engineers who want to design systems and lead agents.\n\n**You give up the comfort of the PR queue as a control mechanism.** The verification loop replaces it. You will, intermittently, miss the PR queue. The first time the cluster ships a bug, you will think *I should have caught that*. Then you'll watch the cluster fix the bug it shipped, and you'll get over it.\n\n## What you get back\n\n**You get back your nights and weekends.** This is not a metaphor. The cluster runs while you sleep. Production fires fix themselves. The on-call rotation goes empty.\n\n**You get back your engineering capacity for hard problems.** When the routine work is happening autonomously, you spend your day on the architectural decisions only a human can make. The work gets *more* interesting, not less.\n\n**You get back the ability to ship features that don't survive a cost-benefit analysis at a normal engineering org.** Redundancies. Polish. Anti-fragile fallbacks. Things that wouldn't justify a sprint become weekend tickets for an idle worker.\n\n**You get back the ability to compete with teams an order of magnitude larger.** This is the one that matters most strategically. Your two-person team becomes the productive equivalent of a fifteen-person team. Your fifteen-person team becomes the equivalent of a hundred. Your competitive position changes shape.\n\n## The simple version\n\nIf you're still vibe coding in 2026, you're operating well. You're shipping faster than the team next door who is still in Stage 2 review-everything mode.\n\nIf you're vibe coding in 2027, you'll be losing market share to competitors who climbed to Stage 5 in 2026.\n\nThe window in which Stage 3 is competitive is finite, and it's closing faster than most builders realize. Climb the curve while it's still cheap to climb.\n\n**[Join the Alchemist waitlist →](/#waitlist)**\n\n---\n\nIf you want the high-level case for the destination, read [The Autonomous Development Manifesto](/blog/autonomous-development).\n\nIf you want the implementation playbook for Stage 5, read [Building a Self-Healing Bug Bot](/blog/self-healing-bug-bot).\n\nIf you want the discipline that makes any of this work, start with [Context Engineering](/blog/context-engineering).",
      "date_published": "2026-05-06T00:00:00.000Z",
      "authors": [
        {
          "name": "Hunter Hodnett"
        }
      ],
      "tags": [
        "vibe-coding",
        "autonomous-development",
        "agentic-coding",
        "maturity-curve"
      ]
    },
    {
      "id": "https://adaas.dev/blog/building-your-first-mcp-server",
      "url": "https://adaas.dev/blog/building-your-first-mcp-server",
      "title": "MCP Is the USB-C of AI: Building Your First MCP Server in 30 Minutes",
      "summary": "The Model Context Protocol is what gives an AI agent senses, the ability to read your database, drive a browser, hit your production logs. This is the tutorial. Thirty minutes from a blank file to a working custom MCP server your Claude Code agent can call. Plus the gotchas that wasted a thousand dollars of our token spend.",
      "content_text": "Every USB cable I owned in 2018 was different. Mini, micro, lightning, USB-A, USB-B, the weird trapezoidal one for old printers. None of them worked with each other. Owning a cable for one device meant nothing for any other device.\n\nUSB-C ended that. One cable, one port, one protocol, and overnight, every device manufacturer's hardware became compatible with everyone else's. The standard didn't make any individual device better. It made the *ecosystem* better.\n\nThe Model Context Protocol (MCP) is doing the same thing for AI agents. Before MCP, every framework had its own format for tools. Each integration was bespoke; each integration was three integrations. After MCP, you write one server, and any MCP-compatible client. Claude Code, Cursor, Goose, increasingly anything, can use it.\n\nThis post is the tutorial I wish I'd had when I started. By the end of 30 minutes you'll have a working custom MCP server, registered to Claude Code, with one real tool that does something useful. Then we'll talk about what to build *after* the tutorial.\n\nIf you want the conceptual case for why MCP matters at all, [the manifesto covers it](/blog/autonomous-development#part-5-the-five-pillars). This post assumes you're convinced and want to build.\n\n## Part 1: What MCP actually is\n\nA language model only does one thing: output text. Every demo of an \"AI agent\", the ones that browse the web, query a database, send Slack messages, those are not the model doing anything. They're the model emitting *text*, and software around the model interpreting that text as an instruction.\n\nThe conventional way for the model to describe what it wants is a *tool call*. The model emits something like:\n\n```\nI want to call a tool named \"read_file\" with the argument {\"path\": \"src/index.ts\"}\n```\n\nThe framework around the model parses that, runs the actual `read_file` tool on your machine, gets the result, and feeds the result back into the model's next turn. The model now \"knows\" the contents of `src/index.ts`. It didn't read the file. The framework read the file. The model just got text saying \"here's what was in it.\"\n\nThis is the loop. Every interesting capability of every modern AI agent reduces to this loop.\n\n> \"The model never executes anything. It just describes what it wants executed, and software outside the model does the executing.\"\n> — Hunter Hodnett, Chipp CTPO\n\nMCP is the standardized way for the framework to find out *what tools exist*, *what parameters they take*, and *how to call them*. Before MCP, every framework had its own format. After MCP, you build a small server that conforms to the protocol, and any MCP-compatible client can use it.\n\nThe protocol itself is conceptually simple: an MCP server speaks JSON-RPC over either standard input/output (for local servers) or HTTP (for remote servers). It exposes a few methods, `tools/list` to enumerate what's available, `tools/call` to invoke one. Claude Code calls those methods on your behalf.\n\nYou don't need to memorize any of this. The TypeScript SDK handles the protocol details for you. You just declare your tools, write the implementations, and the SDK does the JSON-RPC plumbing.\n\n## Part 2: What we're building\n\nTo make this tutorial useful instead of abstract, we'll build a real MCP server: **a database query tool for a hypothetical app's `users` table**.\n\nIn 30 minutes, you'll have:\n\n- A local MCP server written in TypeScript.\n- A `query_users` tool the agent can call to look up users by email or ID.\n- A safe-column allowlist so the agent never accidentally exfiltrates passwords.\n- The server registered with Claude Code so you can use it interactively.\n\nThis is the shape of every internal MCP server we've built at Chipp. By the end you'll know how to write your own for your billing system, your job queue, your feature flag store, anything where a custom tool would beat a generic one.\n\n## Part 3: Setup (5 minutes)\n\nMake a fresh directory and initialize a Node project. We'll use TypeScript and the official MCP SDK.\n\n```bash\nmkdir my-first-mcp && cd my-first-mcp\nnpm init -y\nnpm install @modelcontextprotocol/sdk zod\nnpm install -D typescript @types/node tsx\n```\n\nCreate a minimal `tsconfig.json`:\n\n```json\n{\n  \"compilerOptions\": {\n    \"target\": \"ES2022\",\n    \"module\": \"ESNext\",\n    \"moduleResolution\": \"node\",\n    \"esModuleInterop\": true,\n    \"strict\": true,\n    \"outDir\": \"dist\"\n  },\n  \"include\": [\"src/**/*\"]\n}\n```\n\nAnd add a `\"type\": \"module\"` to your `package.json` so we can use ESM imports.\n\nThat's the entire setup. You're done.\n\n## Part 4: The smallest working server (10 minutes)\n\nCreate `src/server.ts`:\n\n```typescript\n\n\nimport {\n  CallToolRequestSchema,\n  ListToolsRequestSchema,\n} from \"@modelcontextprotocol/sdk/types.js\";\n\n// 1. Define the input schema for the tool.\nconst QueryUsersArgs = z.object({\n  email: z.string().optional(),\n  id: z.string().optional(),\n});\n\n// 2. Define the tool itself.\nconst QUERY_USERS_TOOL = {\n  name: \"query_users\",\n  description:\n    \"Look up a user from the application's users table. \" +\n    \"Pass either `email` or `id`. Returns the user's id, email, name, \" +\n    \"and account_status. Sensitive columns (password_hash, \" +\n    \"oauth_tokens) are filtered at this layer and never exposed.\",\n  inputSchema: {\n    type: \"object\",\n    properties: {\n      email: { type: \"string\", description: \"Email address to look up\" },\n      id: { type: \"string\", description: \"User ID to look up\" },\n    },\n  },\n};\n\n// 3. The fake \"database\" so this tutorial runs standalone.\nconst FAKE_DB = [\n  { id: \"u_1\", email: \"alice@example.com\", name: \"Alice\", account_status: \"active\" },\n  { id: \"u_2\", email: \"bob@example.com\", name: \"Bob\", account_status: \"suspended\" },\n];\n\n// 4. Boot the server, register handlers.\nconst server = new Server(\n  { name: \"my-first-mcp\", version: \"0.1.0\" },\n  { capabilities: { tools: {} } }\n);\n\nserver.setRequestHandler(ListToolsRequestSchema, async () => ({\n  tools: [QUERY_USERS_TOOL],\n}));\n\nserver.setRequestHandler(CallToolRequestSchema, async (req) => {\n  if (req.params.name !== \"query_users\") {\n    throw new Error(`Unknown tool: ${req.params.name}`);\n  }\n  const args = QueryUsersArgs.parse(req.params.arguments);\n  const user = FAKE_DB.find(\n    (u) => (args.email && u.email === args.email) || (args.id && u.id === args.id)\n  );\n  return {\n    content: [\n      {\n        type: \"text\",\n        text: user ? JSON.stringify(user, null, 2) : \"No user found.\",\n      },\n    ],\n  };\n});\n\n// 5. Wire stdio transport.\nconst transport = new StdioServerTransport();\nawait server.connect(transport);\n```\n\nThat's the full server. About 50 lines. Save it.\n\nYou can run it locally to make sure it doesn't crash:\n\n```bash\nnpx tsx src/server.ts\n```\n\nIt'll print nothing because it's waiting for JSON-RPC messages on stdin. Hit Ctrl-C.\n\n## Part 5: Register with Claude Code (3 minutes)\n\nTell Claude Code about the server by editing `~/.claude.json` (or your project's `.mcp.json` if you want it scoped to one project):\n\n```json\n{\n  \"mcpServers\": {\n    \"my-first-mcp\": {\n      \"command\": \"npx\",\n      \"args\": [\"tsx\", \"/absolute/path/to/my-first-mcp/src/server.ts\"]\n    }\n  }\n}\n```\n\nReplace `/absolute/path/to/` with the real path. Restart Claude Code.\n\nVerify the server is loaded:\n\n```\n$ claude\n> /mcp\n```\n\nYou should see `my-first-mcp` in the list, with one tool available.\n\n## Part 6: Try it out (5 minutes)\n\nIn a Claude Code session, ask:\n\n```\nLook up the user with email alice@example.com and tell me their account status.\n```\n\nThe agent should call `query_users` with `{\"email\": \"alice@example.com\"}`, get back the JSON for Alice, and report that her account is active.\n\nTry the negative case:\n\n```\nLook up the user with email nobody@example.com.\n```\n\nThe agent should call the tool, get back \"No user found.\", and report that.\n\nCongratulations, you have a working MCP server. The protocol plumbing is handled. The tool is yours.\n\n## Part 7: The discipline that makes this useful (7 minutes)\n\nNow the part that separates a tutorial MCP from a production MCP.\n\n### Tool descriptions are prompt engineering\n\nThe model decides whether to call a tool *based on the description*. Not the name. Not the parameters. The description.\n\nThe description we wrote above is honest but generic. A production version of the same tool would look more like this:\n\n```typescript\nconst QUERY_USERS_TOOL = {\n  name: \"query_users\",\n  description:\n    \"Look up a user from the application's users table. \" +\n    \"Pass either `email` (case-insensitive, exact match) or `id` (UUID). \" +\n    \"Returns: id, email, name, account_status. \" +\n    \"Sensitive columns (password_hash, oauth_tokens, payment_methods) \" +\n    \"are filtered at this layer and are NEVER exposed in any session. \" +\n    \"Use this when investigating customer-reported issues, debugging \" +\n    \"auth problems, or verifying user state. \" +\n    \"Do NOT use this to enumerate users (no list/scan capability) or \" +\n    \"to modify users (read-only). For mutations, use `update_user` instead.\",\n  ...\n};\n```\n\nThat description does five things the original didn't:\n\n1. Tells the model exactly what data shape to expect.\n2. Explicitly names the columns that are filtered out, so the model doesn't ask for them.\n3. Tells the model when to use the tool (and when not to).\n4. References sibling tools the model should use instead for related operations.\n5. Sets expectations about behavior (case-insensitive, exact match, etc.) so the model doesn't guess.\n\nThis level of detail is the difference between an MCP that the agent calls correctly and one that the agent calls in confusion. The investment compounds, every session that uses the tool benefits.\n\nA heuristic I use: **if the model would have to guess about anything, the description is too short**. Add a sentence.\n\n### Have Claude write your descriptions\n\nThe model is much better at writing descriptions for itself than you are. Once you've drafted the implementation, paste the schema into Claude Code and say *\"Write a tool description for this MCP tool. The agent will read this description to decide when to call the tool. Be specific. Mention edge cases. Tell the agent when not to use this tool.\"*\n\nThe first draft is usually 80% of the way there. Tighten and ship.\n\n### Filter sensitive data at the MCP layer\n\nThis is the big one. **Anything you don't want the agent to see should be filtered out at the server layer, not at the prompt layer.**\n\nIf you tell the agent \"don't look at password hashes\" via a system prompt, the agent might still look at password hashes when it's confused. If your MCP server *cannot return* password hashes, because the SQL has a hardcoded column allowlist that doesn't include them, the agent literally cannot see them. There's no failure mode where it accidentally gets exposed.\n\nWe give our autonomous agents read access to production databases. The reason this is safe is that the database MCP server has hardcoded column allowlists per table. The agent can query `users`, but the only columns the MCP server will return are `id, email, name, account_status, created_at`. Everything else is filtered at the server. Even if the agent goes off the rails and asks for `password_hash`, the MCP server returns the allowlisted columns and ignores the rest.\n\nThis is how you ship MCP servers that touch production data without losing sleep.\n\n### Restart Claude Code when you change the server\n\nThe most painful gotcha. Claude Code starts the MCP server process *once*, at session start. If you change the server's source code mid-session, your edits won't take effect until you restart.\n\nYou will, at some point, edit your MCP server, run it, find a bug, fix the bug, and watch Claude Code call the *old version* of the tool. Then you'll spend an hour debugging \"why isn't my fix working\" before you remember the gotcha.\n\nWhen in doubt: restart Claude Code. There's a cheaper way (the `/mcp` command lets you reconnect), but the brute-force restart is the muscle memory worth building.\n\n### Don't `console.log` from your MCP server\n\nLocal MCP servers communicate with Claude Code over stdio in a structured JSON-RPC format. If you `console.log` anything, that text gets injected into the protocol stream, the framework chokes on the malformed bytes, and your tool stops working in confusing ways.\n\nUse a real logger that writes to *stderr*, not stdout. The MCP SDK has examples. The pain of debugging a stdio-corrupted MCP server is the kind of pain you feel only once before you internalize the rule.\n\n## Part 8: Beyond the Tutorial (What to Actually Build)\n\nThe fake-database MCP we just built is a tutorial, not a production system. Here's what to build *after* this tutorial, in order of leverage.\n\n### A custom database MCP for your real database\n\nReplace the `FAKE_DB` array with a real connection to your application's database. Hardcode the column allowlists per table. Add tools for the queries you actually need, `query_users`, `query_orders`, `query_subscriptions`, whatever your domain requires.\n\nThis will be the most-used MCP in your cluster. We dispatch it on almost every Bug Bot ticket.\n\n### A custom browser MCP\n\nThe single highest-leverage MCP in autonomous development. ([Why →](/blog/mcp-is-not-optional))\n\nWrap a local Chromium instance via the [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/). Expose tools for navigate, screenshot, click, fill, console-log retrieval. Bake in your dev login flow as its own tool, `browser_dev_login(role: \"free\" | \"enterprise\" | \"admin\")` that bypasses your auth flow with seeded test credentials.\n\nThat last tool is the differentiator. Off-the-shelf browser MCPs are generic. The MCP we run for Chipp knows how to log in as any user role without going through OAuth. That domain knowledge is what makes verification fast enough to be useful in an autonomous loop.\n\n### A custom log-drain MCP\n\nWrap your log aggregator (we use Loki). Expose `query(labels, time_range)` and a higher-level `user_breadcrumbs(user_id, time_range)` that pulls a user's recent interactions before an error fired.\n\nThe `user_breadcrumbs` tool is what lets the agent reconstruct the user journey that led to a bug, and propose fixes that match real usage, not synthetic edge cases.\n\n### A custom cron / job MCP\n\nIf you have any kind of background job system, wrap it. Tools for *list jobs*, *query job status*, *trigger a job manually*. The agent will use these to debug job failures without you having to baby-sit.\n\n### Don't roll your own when an official one exists\n\nFor Stripe, GitHub, Cloudflare, Supabase, Notion, use the official MCPs. The vendors maintain them. They keep up with API changes. They handle auth.\n\nWhere to roll your own: anything internal to your codebase, anything where the off-the-shelf MCP is too generic to give the agent the right context (we found this true for databases, off-the-shelf DB MCPs hallucinated column names constantly).\n\n## Part 9: The mental model to keep\n\nWithout MCP, you have a model that emits text. That's it. It can't read your code, can't look at your screen, can't query your database, can't verify anything it produces.\n\nWith MCP, the model can interact with the world. It can check its own work. It can correct its own mistakes. It can chain together capabilities that no single component of your system would have alone.\n\nThat's not an upgrade. That's a phase transition.\n\nEvery additional MCP server you build is more *senses* the agent has. The agent that ships your code in 2027 will have access to twenty MCP servers and will navigate them as fluently as you navigate your filesystem. The teams that build those servers earliest will have agents that ship the most reliable code, because their agents will have the most ways to verify themselves.\n\nBuild one this weekend. Then build the next one. The cluster works as well as your tools let it.\n\n**[Join the Alchemist waitlist →](/#waitlist)**\n\n---\n\nIf you want the conceptual case for MCP, the [manifesto](/blog/autonomous-development#part-5-the-five-pillars) covers why MCP is one of the five pillars of autonomous development.\n\nIf you want to see how MCP fits into a production cluster, read [Building a Self-Healing Bug Bot](/blog/self-healing-bug-bot). Component 4 walks through the four MCPs every Bug Bot setup needs.\n\nIf you want the foundational discipline for managing the context MCPs add, read [Context Engineering](/blog/context-engineering).",
      "date_published": "2026-05-05T00:00:00.000Z",
      "authors": [
        {
          "name": "Hunter Hodnett"
        }
      ],
      "tags": [
        "mcp",
        "mcp-server",
        "claude-code-mcp",
        "tutorial",
        "autonomous-development"
      ]
    },
    {
      "id": "https://adaas.dev/blog/skills-vs-sub-agents",
      "url": "https://adaas.dev/blog/skills-vs-sub-agents",
      "title": "Skills vs Sub-Agents: When to Use Each in Claude Code",
      "summary": "Skills and sub-agents are the two tools you reach for when CLAUDE.md and hub-and-spoke aren't enough. They look similar, both give Claude specialized capability, but they're architecturally different in a way that determines whether they save your context budget or burn it.",
      "content_text": "Once your `CLAUDE.md` is dialed in and you've sprinkled directory-scoped `CLAUDE.md` files through your codebase, you'll start hitting a different class of problem. There are kinds of knowledge that don't fit either pattern.\n\nSome knowledge is too big to put in `CLAUDE.md`, it would bloat every context window with information you only sometimes need. Some knowledge requires *work* to retrieve, not just to read. Some tasks are fundamentally side quests: you don't want them polluting your main agent's context, but you do want them done.\n\nClaude Code has two features for this: **skills** and **sub-agents**. They look superficially similar, both let you give Claude specialized capability, and most people I talk to use them interchangeably for a few weeks before they figure out the actual difference.\n\nThe actual difference is one sentence:\n\n**Skills are knowledge the agent reads while it's working. Sub-agents are work the agent delegates to a separate instance.**\n\nOnce you internalize that, the rest writes itself. But it took me a quarter to figure out, so let me save you the time.\n\n## What a skill is\n\nA skill is a markdown file with a name, a description, and a body. The body is whatever you want, usually instructions, examples, or domain-specific rules. The description is what Claude reads to decide whether to *invoke* the skill at all.\n\nWhen the description matches what the user asked for, the skill loads. The body of the skill becomes part of the context window for the rest of that session, and the agent has it as reference while it works.\n\nThe mental model is **a cheat sheet on your desk**. You're working at a desk. There's a sticky note on the desk. While you work, you glance at the sticky note for the formulas you can't remember.\n\nI have a skill called `chipp-design` that contains every visual convention of the Chipp brand: our color tokens, our spacing scale, our component library, when to use which animation, how to do glass-morphism in a way that's still legible. The description tells Claude: *\"Build UI components and pages following the Chipp brand design system for Svelte 5. Use this skill when creating Svelte components, pages, or UI elements.\"*\n\nWhen I ask the agent to build a settings page, the description matches, the skill loads into context, and the agent now has the entire design system as reference. Without the skill, the agent would invent tokens, hardcode hex colors, and ship something that doesn't match the rest of the platform. With it, the output looks like our team built it.\n\nThat's a skill working at its best. The cost: tokens. The skill body, which can be substantial, is now occupying space in my context window. Every other tool call has less room.\n\n## What a sub-agent is\n\nA sub-agent is a separate Claude Code session that the main agent spawns, gives a starting prompt, and waits for a result. The sub-agent has its own context window. It runs its own tool calls. When it's done, it sends a summary back to the calling agent, and only the summary lands in the main context window.\n\nThe mental model is **sending an intern to the library**. You're sitting at your desk. You realize you need to know something obscure, say, what's the current state of all our Kubernetes pods. Rather than walking to the library yourself, you send an intern. They go off, do the work, come back, and hand you a one-page summary. You absorb the summary in seconds. The intern absorbed every page of the encyclopedia.\n\nI have a sub-agent called `infra-ops`. Its system prompt knows everything about our Kubernetes cluster: which `kubectl` commands are safe, where the production logs live, how to read deployment YAML, what's normal versus alarming. When the main agent runs into something like *\"pods are restarting in production,\"* it doesn't try to investigate itself, it spawns the `infra-ops` sub-agent.\n\nThe sub-agent fills its own 1M-token context window with raw `kubectl` output, log excerpts, deployment manifests. It correlates them, finds the actual issue, and reports back to the main agent: *\"Pods are OOM-killing because the last deploy lowered the memory limit too aggressively. Recommend bumping `requests.memory` from 512Mi to 1Gi.\"*\n\nThat two-sentence summary lands in my main context window. The 950k of garbage that was needed to derive it stays in the sub-agent's window, where it can't pollute anything.\n\n## The token math\n\nBoth tools cost tokens. They don't cost them in the same way.\n\n**A skill** spends from your *current* context budget. The skill body is loaded into the active session's context window. If your skill is 8,000 tokens long, you have 8,000 fewer tokens for everything else this session. If you load three skills, that's 24,000 tokens gone.\n\n**A sub-agent** spends from a *separate* context budget. The sub-agent has its own window, its own tool calls, its own model invocations. From the calling agent's perspective, it spent the cost of one tool call: *\"spawn sub-agent, here's the prompt, get back a summary.\"* The sub-agent might have spent 150,000 tokens internally to produce that summary, but the calling agent doesn't see them.\n\nThis matters more than you'd think. On a complex task, the main agent might spawn five sub-agents over the course of its run. Each sub-agent fills 100,000–200,000 tokens of its own context window doing real work. The main agent, meanwhile, accumulates the five summaries, maybe 5,000 tokens total. The main agent stays nimble. It doesn't compact. It doesn't lose track of why it started the task.\n\nIf you tried to do the same work with five skills loaded into the main agent, you'd have to load all the relevant context for all five domains into the same window. The main agent would either compact halfway through or run out of budget completely.\n\nThis is the central trick. **Skills concentrate context in the main agent. Sub-agents distribute context across separate agents.**\n\n> \"If you'd describe the task as 'go figure out X and tell me,' it's a sub-agent. If you'd describe it as 'while you work, remember X,' it's a skill.\"\n> — Hunter Hodnett, Chipp CTPO\n\n## When to reach for which\n\nThe decision rubric is simpler than I expected once I had it.\n\n**Use a skill when:**\n\n- You need the knowledge *while the agent is actively coding*, referencing it dozens of times during the work.\n- The knowledge is short enough that loading it doesn't blow the budget.\n- The work is in the agent's main domain (writing the feature, not debugging infra).\n\nExamples: design systems, code style guides, API conventions, common pitfalls for the current subsystem.\n\n**Use a sub-agent when:**\n\n- The task is *fact-finding* or *side investigation*, read a bunch of stuff, return one insight.\n- The task involves a lot of tool calls that the main agent doesn't need to see.\n- The task is in a separate domain (debugging Kubernetes from a feature-development session).\n- The task can fail without affecting the main work.\n\nExamples: debugging production issues, researching how a third-party library works, auditing the codebase for instances of a deprecated pattern, summarizing a long document.\n\nA test that almost always works: **if you would describe the task as \"go figure out X and tell me,\" it's a sub-agent. If you'd describe it as \"while you work, remember X,\" it's a skill.**\n\n## Real examples from Bug Bot\n\nThree live examples from our autonomous cluster.\n\n### `chipp-design` (skill)\n\nLoads when the agent is doing UI work. About 6,000 tokens of design rules, component library reference, and scar-tissue notes about CSS gotchas. Description: *\"Build UI components and pages following the Chipp brand design system for Svelte 5.\"*\n\nThe agent reads it dozens of times during a UI ticket, checking spacing tokens, color palette, component conventions. Loading it as a skill means the reference is *there* the whole time, not behind a tool call.\n\nIf we tried to handle this with a sub-agent, *\"go figure out our design system and tell me what to use\"*, the agent would have to dispatch the sub-agent, wait for the summary, and then realize it needed more detail and dispatch again. Skill is the right tool.\n\n### `infra-ops` (sub-agent)\n\nLoads when the main agent has a production issue and needs investigation. The sub-agent has its own runbook (kubectl commands, log query patterns, common failure modes), its own tools (kubectl MCP, Loki MCP), and its own context window.\n\nThe main agent dispatches it with one tool call: *\"Investigate why pods restarted in the last hour.\"* The sub-agent runs 47 tool calls, fills 200k of context window, correlates everything, and returns a one-paragraph diagnosis.\n\nIf we tried to handle this with a skill, *\"here's everything about our K8s cluster, now investigate\"*, the main agent's context would fill up with raw kubectl output and lose its grip on the actual ticket. Sub-agent is the right tool.\n\n### `feature-deep-dive` (sub-agent)\n\nLoads when the main agent needs to understand how an existing feature works before modifying it. The sub-agent reads the feature's code, related tests, recent git history, and any relevant docs, then returns an architecture summary.\n\nThe main agent gets the summary in its context, applies it to the modification work, ships the change. The 100k of code-reading the sub-agent did doesn't pollute the main session.\n\nThis is one of our most-dispatched sub-agents. Almost every non-trivial feature ticket dispatches it as the first step.\n\n## Anti-patterns I've shipped and regretted\n\nThree patterns I've burned tokens learning are wrong.\n\n### The omnibus skill\n\nI had one called `everything-about-our-platform` that loaded a 30,000-token document with our entire architecture. It was easier than thinking about which skill to write. The agent would load it for *every* task, including ones that didn't need it, and we'd lose 30,000 tokens of budget every session.\n\nSplitting it into focused skills (`chipp-billing`, `chipp-design`, `chipp-routing`, `chipp-auth`) meant only the relevant 5,000 tokens loaded at a time. Big improvement.\n\nIf your skill description starts with *\"general knowledge about…\"*, you're building an omnibus. Split it.\n\n### The sub-agent for trivial work\n\nSpawning a sub-agent has overhead, a separate model invocation, the round-trip of starting a new session, the cost of sending the initial prompt. If the task is small, just do it inline.\n\nSub-agents pay off when the task would fill 50,000+ tokens of context. They cost more than they save when the task would fill 500.\n\nHeuristic: if you'd be embarrassed to interrupt a colleague to ask the question, don't dispatch a sub-agent for it.\n\n### Skills used as sub-agents\n\nThis is the most common mistake I see. Someone wants the agent to *\"go check the database for orphaned records.\"* They write a skill called `database-investigation` with instructions and example queries. The skill loads, then the *main agent* runs the queries, and now its context window is filling up with raw database rows.\n\nThey wanted the work done elsewhere. They gave themselves a cheat sheet instead.\n\nThe fix is a sub-agent: spawn one, let it run the queries, have it return *\"found 47 orphans, here are the IDs.\"* Main context window stays clean.\n\nThe pattern to internalize: skills are reference material the main agent uses to do work itself. Sub-agents are work delegated to a separate agent, with only the result coming back.\n\n## The simplest decision\n\nIf you want the *agent* to know something, write a skill.\n\nIf you want *someone else* to know something, spawn a sub-agent.\n\nThat's it. That's the whole post.\n\nThe deeper the autonomous cluster gets, and the [Bug Bot pipeline](/blog/self-healing-bug-bot) leans on both heavily, the more these two patterns become the basic structural elements of every workflow. We use both, constantly. The skills carry the patterns we want consistent across all our work. The sub-agents carry the heavy lifting that would otherwise crush our main agent's context.\n\nGet this distinction right and your context budget stops being the bottleneck on everything you ship.\n\n**[Join the Alchemist waitlist →](/#waitlist)**\n\n---\n\nIf you want the foundational discipline these patterns build on, read [Context Engineering: The Skill That Turns Claude Into a Production Co-Developer](/blog/context-engineering).\n\nIf you want the architecture for managing skills and sub-agents at scale, read [CLAUDE.md Architecture](/blog/claude-md-architecture).\n\nIf you want to see skills and sub-agents at work in a production cluster, read [Building a Self-Healing Bug Bot](/blog/self-healing-bug-bot).",
      "date_published": "2026-05-04T00:00:00.000Z",
      "authors": [
        {
          "name": "Hunter Hodnett"
        }
      ],
      "tags": [
        "claude-code-subagents",
        "claude-code-agent-teams",
        "skills",
        "sub-agents",
        "claude-code"
      ]
    },
    {
      "id": "https://adaas.dev/blog/agentic-design-patterns",
      "url": "https://adaas.dev/blog/agentic-design-patterns",
      "title": "Agentic Design Patterns for Production: 7 Patterns We Battle-Tested at Chipp",
      "summary": "Design Patterns named the moves OOP engineers were already using ad-hoc, and naming them made them transferable. Agentic systems need the same treatment now. Seven patterns we've battle-tested across two years of running autonomous development at Chipp, what each one solves, when to use it, and how to know you've gotten it wrong.",
      "content_text": "The Gang of Four published *Design Patterns* in 1994. They didn't invent any of the patterns in the book. They named twenty-three patterns that OOP developers were already using ad-hoc, and the act of naming them made the practice transferable. Engineers who'd never thought about Strategy or Decorator could read the book, recognize the moves they were already half-doing, and start using them on purpose.\n\nAgentic systems are at the same place now. The shape of how to build production-grade autonomous workflows is becoming clear; the patterns are emerging across teams that have shipped real systems. They just don't have names yet.\n\nThis post names seven of them. Each one is a pattern we've battle-tested across two years of running our [autonomous development cluster](/blog/self-healing-bug-bot) at Chipp. Each one solves a specific failure mode. Each one is portable to whatever stack you're building on.\n\nYou don't need all seven to start. Pick three. Implement them this month. The other four will become obvious once the first three are working.\n\n## Pattern 1: The Multi-Stage Pipeline\n\n**Problem**: A single Claude Code session that tries to do everything (research, implement, review, document, push) runs out of context budget. The session compacts. The agent loses the thread. The output quality collapses.\n\n**Pattern**: Split the work into independent stages. Each stage gets its own Claude Code session with a fresh context window. Stages communicate by writing markdown files to disk; the next stage reads only the file the prior one wrote.\n\n**Our pipeline at Chipp**:\n\n```\n[Trigger] → [Phase 0: Doc retrieval] → [Phase 1: Research]\n         → [Phase 2: Implement]   → [Phase 3: Code review]\n         → [Phase 4: Docs update] → [Phase 5: Push to prod]\n```\n\nEach phase is its own Claude session. Phase 1's output is `plan.md`. Phase 2 reads `plan.md` (no other context) and writes the code. Phase 3 reads the diff (no other context) and reviews it. And so on.\n\n**Why this works**: A single 1M-token context window can hold a lot, but it can't hold *everything* you want it to hold across an entire feature ticket. By the time the agent has read 30 files, queried logs, formed a hypothesis, written code, run tests, and reviewed its own diff, the window is full and the early reasoning has been compacted to a useless paragraph.\n\nSplitting into stages gives each stage a fresh window. The stage that's writing code doesn't need to remember every file the research stage looked at, it just needs the plan. The plan is the *distilled* output of the research, and distillation survives where raw evidence wouldn't.\n\n**Implementation**: Stage outputs are markdown files in a known location. The bash harness orchestrates the handoff. Don't try to do this with one long-running session and \"memory.\" It will fail.\n\n**Anti-pattern**: The temptation is to add more stages. Five is the right number for most pipelines. Don't go to ten. Each additional stage costs latency and a chance for handoff failure. If a stage isn't earning its place, merge it into a neighbor.\n\n## Pattern 2: Sub-Agent Dilution\n\n**Problem**: Some investigations require *huge* amounts of context, reading thousands of lines of logs, running dozens of tool calls, correlating evidence across many sources. If you do this in your main session, you've burned the budget on context the main task doesn't need.\n\n**Pattern**: Spawn a sub-agent. The sub-agent has its own context window. It does the heavy investigation. It returns a one-paragraph insight to the calling agent. The 950k tokens of evidence stay in the sub-agent's window, where they belong.\n\n**Our `infra-ops` sub-agent**: When the main agent encounters something like *\"pods are restarting in production,\"* it doesn't try to investigate itself. It dispatches the `infra-ops` sub-agent. That sub-agent runs 47 `kubectl` commands, queries Loki, cross-references the deploy history, and returns: *\"OOM-killing because the last deploy lowered the memory limit too aggressively. Recommend bumping `requests.memory` from 512Mi to 1Gi.\"*\n\nThat two-sentence summary is what lands in the main session. ([Full pattern →](/blog/skills-vs-sub-agents))\n\n**Why this works**: The mental model is *sending an intern to the library*. You don't want every page they read; you want the answer. Sub-agents give you the architectural shape to do exactly that.\n\n**Implementation**: Define sub-agents in `.claude/agents/`. Each one is a markdown file with its own system prompt and tool list. Reference them in your root `CLAUDE.md` so the main agent knows when to dispatch which.\n\n**Anti-pattern**: Don't dispatch a sub-agent for a task the main agent could finish in two tool calls. Sub-agents have overhead, a separate model invocation, the prompt round-trip, the deserialization of the result. They pay off when the task would otherwise fill 50,000+ tokens of context. They cost more than they save when the task is small.\n\n## Pattern 3: The Browser Verification Loop\n\n**Problem**: The agent writes code that compiles, passes tests, and looks correct in the diff. None of that proves the code *works in a browser*. Buttons can render in the wrong color. Click handlers can throw runtime exceptions. API calls can fail. The agent doesn't know.\n\n**Pattern**: After every code change, the agent spins up a dev server, opens a browser via the [browser MCP](/blog/building-your-first-mcp-server), navigates to the affected page, takes a screenshot, reads the console logs, and verifies the change worked. If anything's wrong, the agent forms a new hypothesis and iterates.\n\n**The actual loop**:\n\n1. Code changes saved in worktree.\n2. Dev server (already running on dedicated port) auto-reloads.\n3. Agent calls `browser_navigate('localhost:5184/affected-page')`.\n4. Agent calls `browser_screenshot()`. Reads the image (multimodal models *see* the screenshot).\n5. Agent calls `browser_console_logs()`. Reads the console output.\n6. If no errors, agent calls `browser_click('#confirm')` or whatever interaction tests the change.\n7. Repeat screenshot + logs read.\n8. If errors, the agent forms a hypothesis, edits the code, loop restarts.\n\n**Why this works**: Most \"AI ships bad code\" stories are stories about agents that wrote plausible-looking code, never tested it, and pushed. The browser loop is the difference between \"the agent thinks the code works\" and \"the agent has *checked* that the code works.\" Closing this loop is the single architectural change that turned our cluster from interesting demo to production system.\n\n**Implementation**: Custom browser MCP wrapping a headless Chromium via the [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/). Off-the-shelf browser MCPs work for prototyping. For production, build your own, bake in your dev login flow, your seed data, your test scenarios. The custom version is the difference between fast and slow autonomous verification.\n\n**Anti-pattern**: Don't run the verification loop on a shared dev server. Each agent worker needs its own port (we use 5180–5187 for our 8-worker pool) so parallel agents don't fight for the same port. Each agent also needs its own git worktree so they don't step on each other's changes mid-loop.\n\n## Pattern 4: CLAUDE.md as Scar Tissue\n\n**Problem**: The agent makes the same mistake on every session. You correct it interactively. The next session, the correction is gone, the agent doesn't remember what it learned in a different session. You're paying for the same lesson over and over.\n\n**Pattern**: Treat your `CLAUDE.md` as a scar tissue document. Every time the agent makes the same class of mistake three times, stop the session, write a rule into `CLAUDE.md` that prevents it, and continue. Over months, your `CLAUDE.md` accumulates the real rules of your codebase, the ones you can only learn by getting bitten.\n\n**Why three strikes**: Once is an outlier. Twice is suspicious. Three times is a pattern. Patterns are what `CLAUDE.md` is for. Adding a rule per mistake bloats the file with one-off lessons that dilute the load-bearing rules.\n\n**Why this works**: `CLAUDE.md` loads in *every* session and survives compaction. It's the only place to put context that you want the agent to have *forever* without paying for re-discovery on every run. Every line in `CLAUDE.md` pays compounding dividends.\n\n**Implementation**: Have the *agent* write the rules into `CLAUDE.md` for you. The model knows what kind of rule will register on its own future inference better than you do. When the agent makes a mistake, prompt it: *\"Add a rule to `CLAUDE.md` that prevents this exact mistake. Cite the failure mode. Make it specific enough to act on.\"* Then read what it wrote and tighten it.\n\n**Anti-pattern**: Aspirational `CLAUDE.md`s. Rules like *\"always write clean code\"* and *\"prefer composition over inheritance\"* are too vague to act on. The agent ignores them. Replace with specific, scar-tissue-grounded rules tied to real failure modes you've seen. ([Full discipline →](/blog/claude-md-architecture))\n\n## Pattern 5: The Auto-Load Table\n\n**Problem**: Some context is too domain-specific to live in your root `CLAUDE.md` (it would bloat every session) but too cross-cutting to live in a single subdirectory `CLAUDE.md` (the rules apply across the codebase whenever a *topic* is mentioned, not whenever a *directory* is touched).\n\n**Pattern**: At the top of your root `CLAUDE.md`, put a small markdown table mapping keywords to documentation files. When a prompt mentions any of the keywords, the agent reads the corresponding doc into context before starting work.\n\n**Our table at Chipp**:\n\n```markdown\n## Auto-load table\n\n| Mention | Read |\n|---|---|\n| billing, stripe, payment, subscription | docs/billing.md |\n| auth, login, session, oauth | docs/auth.md |\n| websocket, realtime, streaming | docs/realtime.md |\n| voice, livekit, transfer | docs/voice-agents.md |\n| migration, schema, kysely | docs/db-migrations.md |\n```\n\nThe keywords are inclusive, if a ticket mentions \"stripe\" *or* \"subscription\" *or* \"billing,\" the agent loads `docs/billing.md`. The rules in that doc are far too specific to put in the root `CLAUDE.md` (Stripe API quirks, our shadow-billing system, the eight failure modes of webhook delivery), but they're load-bearing whenever the work touches billing.\n\n**Why this works**: It scales the system's domain knowledge horizontally. You can add ten more rows to the auto-load table without inflating the per-session context cost, the docs only load when relevant. Every successful autonomous run that produces a useful insight about a subsystem can write a new doc and get a new row, and tomorrow's tickets get smarter.\n\n**Implementation**: Put the table at the top of your root `CLAUDE.md`. Be conservative with keywords, false positives waste budget. Generate the docs lazily as you ship, don't try to write all the docs upfront. Use the doc-update phase of your pipeline (Pattern 1) to keep the docs current.\n\n**Anti-pattern**: Don't load *every* doc on every session \"just in case.\" That's the bloat that this pattern is designed to prevent. Trust the keyword match. If the agent fails to load a doc it should have loaded, the keyword list was wrong, fix the list, don't load everything.\n\n## Pattern 6: The Bash Harness Wrapper\n\n**Problem**: Claude Code is non-deterministic. It can hang. It can run for hours. It can attempt commands you don't want it running. It can finish work and forget to push. Your business needs to be deterministic. Your tokens are limited. Something has to be the adult in the room.\n\n**Pattern**: Wrap every Claude Code invocation in a bash script that supervises the session. The script enforces timeouts, kills hangs, bans dangerous commands, forces the final commit and push, cleans up worktrees, and writes outcome labels.\n\n**What our harness enforces**:\n\n- **Idle kill**: if no tool call fires for 5 minutes, kill the session. Catches hangs.\n- **Wall-clock timeout**: `timeout 7200` (2 hours) caps runaways.\n- **Banned-flag grep**: `git push --no-verify`, `git reset --hard`, `rm -rf` are aborted on detection.\n- **Forced commit + push**: at the end of every session, check the worktree state and force the push if Claude forgot.\n- **Worktree cleanup**: each run isolated; nothing leaks between workers.\n- **Outcome logging**: every run writes a JSONL row to a fine-tuning archive. (Pattern 7.)\n\n**Why this works**: The model is good at writing code. It's bad at managing its own time, recognizing when it's stuck, or cleaning up after itself. The harness handles those things deterministically so the model can focus on the work.\n\n> \"An autonomous agent without a bash harness is an intern with no manager, no deadline, and an unlimited API budget.\"\n> — Hunter Hodnett, Chipp CTPO\n\n**Implementation**: Bash, not Node or Python. Bash is the right language for wrapping Unix processes, subprocess management, signals, timeouts, pipes are all concise in bash and verbose in everything else. Bash is also debuggable in production (no compilation step) and Claude has more bash training data than any other shell language, so the agent can edit the harness too.\n\nThe skeleton is in [the bug bot post](/blog/self-healing-bug-bot#component-2--the-bash-harness). About 200 lines covers everything above.\n\n**Anti-pattern**: Don't try to build the harness in your application stack. Don't make it a feature of your CI system. The harness is a deliberately tiny, deliberately separate piece of infrastructure. Keep it that way.\n\n## Pattern 7: The Outcome-Labeled JSONL Archive\n\n**Problem**: Every autonomous session generates a record of how a frontier model approached a real task in your codebase. That's training data. If you don't capture it, it's gone forever. If you do capture it, you have the basis for fine-tuning a cheaper, specialized model on your own work, the kind of moat that compounds.\n\n**Pattern**: After every autonomous run, append a JSONL row to a long-term archive describing the run, the stages, the token spend, the tool calls, the diff, and the outcome label.\n\n**Our archive row**:\n\n```json\n{\n  \"ticket_id\": \"billing-create-customer-null-pmt\",\n  \"trigger_source\": \"grafana\",\n  \"started\": \"2026-04-15T03:31:18Z\",\n  \"finished\": \"2026-04-15T03:47:02Z\",\n  \"stages\": {\n    \"research\": { \"tokens\": 412053, \"tool_calls\": 38 },\n    \"implement\": { \"tokens\": 187234, \"tool_calls\": 23 },\n    \"review\": { \"tokens\": 91482, \"tool_calls\": 12, \"edits\": 1 },\n    \"docs\": { \"tokens\": 43210, \"tool_calls\": 4 },\n    \"push\": { \"tokens\": 0, \"tool_calls\": 0 }\n  },\n  \"outcome\": \"clean\",\n  \"regressions_detected_24h\": false\n}\n```\n\nThe `outcome` field is the label. `clean` means: review made ≤5 edits, all tests passed first try, no regressions detected within 24 hours of deploy. `messy` means anything else.\n\n**Why this works**: Labeled data is the asset that produces fine-tuned models. Every successful autonomous run produces a labeled training row showing how a frontier model approached a real engineering task. After a quarter, you have thousands of rows. After a year, you have a dataset no other team can replicate, because it's specific to *your* codebase and *your* practice.\n\nYou may never train a model on this data. That's fine. The decision to *capture* it is one you make today. The decision to *use* it is one you can defer for years. But you can't decide to use data you didn't capture.\n\n**Implementation**: One JSONL row per session. Append-only file in cheap storage (S3, R2, even disk for now). Label the outcome with whatever automated heuristics you have, review edit count, test pass rate, post-deploy regression detection. Don't try to label perfectly; the labels can be improved later.\n\n**Anti-pattern**: Don't try to make this data structured beyond JSONL. JSONL is append-only, easy to grep, easy to load into training pipelines. SQL is overkill. NoSQL is a different kind of overkill. Just the file.\n\n## How to use these patterns\n\nYou don't need to implement all seven on day one. Most teams who try fail at exactly that, they read this post, they get excited, they try to build a 7-pattern cluster in two weeks, they fail at three of them, and they conclude none of it works.\n\nThe order I'd implement them in:\n\n1. **Browser verification loop** (Pattern 3). Without this, you can't autonomy. Build it first. Even if you do nothing else from this list, build this.\n2. **Multi-stage pipeline** (Pattern 1). The next biggest leverage. Splits your sessions, controls your context budgets, makes everything else possible.\n3. **Bash harness** (Pattern 6). Once you have a pipeline, you need the supervisor. This is the difference between a hobby project and something you can leave running overnight.\n4. **CLAUDE.md as scar tissue** (Pattern 4). Discipline, not infrastructure. Start practicing it on day one of using any of the others. The compounding starts immediately.\n5. **Auto-load table** (Pattern 5). After you've shipped enough autonomous tickets to start writing real `/docs/` files. Premature otherwise.\n6. **Sub-agent dilution** (Pattern 2). Once your main pipeline is hitting context-budget walls on the heavy investigations. Solves a real problem; not a problem you have on day one.\n7. **Outcome-labeled archive** (Pattern 7). Nothing-to-lose pattern. Start it as soon as you have a pipeline. Even if you never use the data, you'll be glad you have it in a year when distillation becomes the obvious move.\n\nPick three. Build them this month. The next three become obvious once the first three are working.\n\n## What you actually get\n\nA team running these seven patterns ships in a different category than a team running interactive Claude Code sessions, and a *radically* different category than a team still doing all-human engineering.\n\nThe numbers from our cluster, honestly:\n\n- 20–30 production deploys per day on a two-person engineering team.\n- 70–80% first-attempt success rate on autonomous tickets.\n- Mean time from production error to fix in production: ~30 minutes, autonomously.\n- Token cost per ticket: low double-digit dollars on a frontier model.\n- Pull requests: zero.\n- Pages we receive overnight: zero.\n\nThe patterns aren't magic. Each one solves a specific failure mode. Together they compose into a system where the failure modes don't compound, when one pattern hits its edge case, the others catch it.\n\nThe Gang of Four made OOP transferable by naming the moves. The seven patterns above make autonomous development transferable. Use them, name them, build on them. We'll be writing about the next batch as they emerge.\n\n**[Join the Alchemist waitlist →](/#waitlist)**\n\n---\n\nIf you want the high-level case for autonomous development, read [The Autonomous Development Manifesto](/blog/autonomous-development).\n\nIf you want the implementation walkthrough of all seven patterns wired into one cluster, read [Building a Self-Healing Bug Bot](/blog/self-healing-bug-bot).\n\nIf you want the discipline that underpins every pattern in this post, read [Context Engineering](/blog/context-engineering).",
      "date_published": "2026-05-03T00:00:00.000Z",
      "authors": [
        {
          "name": "Hunter Hodnett"
        }
      ],
      "tags": [
        "agentic-design-patterns",
        "agentic-workflows",
        "autonomous-development",
        "claude-code",
        "architecture"
      ]
    },
    {
      "id": "https://adaas.dev/blog/claude-md-architecture",
      "url": "https://adaas.dev/blog/claude-md-architecture",
      "title": "CLAUDE.md Architecture: A Hub-and-Spoke Pattern for Autonomous Codebases",
      "summary": "Your CLAUDE.md is the highest-leverage file in your codebase. It survives compaction, loads in every session, and accumulates the rules that turn a generic Claude into your codebase's senior engineer. Here's the full architecture, root file, directory hub-and-spoke, scar-tissue practice, auto-load tables, and the anti-patterns that wreck it.",
      "content_text": "`CLAUDE.md` is the most important file in any Claude Code project.\n\nI mean that literally. There's no other file that touches every session, survives every compaction, and accumulates value over months of use. Your `package.json` doesn't. Your README doesn't. Your tests don't. `CLAUDE.md` does.\n\nAnd yet most teams I've worked with treat it as a write-once afterthought. They generated it with `claude /init` six months ago, never opened it again, and now wonder why their agents keep making the same five mistakes.\n\nThis post is the full architecture of `CLAUDE.md` as we run it at Chipp. Root file, hub-and-spoke directory loading, scar-tissue practice, auto-load tables, sub-agent definitions, and the specific anti-patterns that turn a useful `CLAUDE.md` into a useless one.\n\nIf you only have time to internalize one thing from the [autonomous development series](/blog/autonomous-development), make it this one.\n\n## What `CLAUDE.md` actually is\n\n`CLAUDE.md` is a markdown file in your project root. Claude Code finds it automatically and prepends its contents to the system prompt of every session in that project.\n\nThat's the whole mechanism. There's no magic. There's no plugin. The file is just *there*, and Claude reads it.\n\nThree properties make it special.\n\n**Property 1: It loads in every session.** Whether you start an interactive session, run `claude -p` headless, spawn a sub-agent, or dispatch a session from a webhook, `CLAUDE.md` is part of the system prompt every time. There is no opt-out.\n\n**Property 2: It survives compaction.** When the context window fills and the model summarizes everything else, `CLAUDE.md` is one of the few things that stays intact. The post-compaction agent has lost the file contents you read, the grep results you ran, the reasoning that led to the current state, but it still has `CLAUDE.md`. ([Why compaction matters →](/blog/context-engineering#the-compaction-trap))\n\n**Property 3: It's the only training data you fully own.** The model is trained on the public internet. The MCP servers you use are someone else's tools. The Claude Code CLI is Anthropic's product. `CLAUDE.md` is *yours*. It's the one piece of context the agent loads that nobody else can see, replicate, or take away.\n\nThese three properties make `CLAUDE.md` the *highest-leverage file in your codebase*. Every line in it pays compounding dividends. Every line you neglect to add costs you the same mistake, repeated, every time the agent runs.\n\n## The scar-tissue practice\n\nThe fundamental discipline of `CLAUDE.md` is treating it as a *scar tissue document*, not an aspiration document.\n\nAn aspiration document says things you wish were true. *\"Always write clean code.\" \"Prefer composition over inheritance.\" \"Use semantic HTML.\"* The agent ignores most of it because the rules are too vague to act on and too unmotivated to take seriously.\n\nA scar tissue document says things that have actually bitten you. *\"Don't use lodash; we removed it in v4 and the bundle size went up 30KB last time someone re-added it.\" \"Database migrations have to be backward-compatible because we deploy without taking the API down.\" \"The `useUser` hook returns null during the auth bootstrap; check for it.\"*\n\nEach of those sentences is a real bug we hit. Each one is a rule we wrote into `CLAUDE.md` after the third or fourth occurrence. Each one prevents the agent from making the same mistake again. The aspiration version of those rules would have been ignored. The scar tissue version is treated as load-bearing.\n\nThe mechanic for building scar tissue:\n\n1. The agent makes a mistake. Notice it.\n2. Don't fix it yet. First, prompt the agent: *\"Add a rule to `CLAUDE.md` that prevents this exact mistake. Cite the failure mode you just hit. Make the rule specific enough to act on.\"*\n3. Let the agent write the rule. The model knows what kind of rule will register on its own future inference better than you do.\n4. Read the rule. Tighten it if needed. Commit it.\n5. Now fix the original mistake.\n\nAfter six months of doing this, your `CLAUDE.md` is a textbook of your codebase's hidden rules. Your agents stop making the mistakes you've already paid for.\n\n> \"I have my autonomous AI cluster updating its own `CLAUDE.md`. I honestly barely know what's in there these days.\"\n> — Hunter Hodnett, Chipp CTPO\n\n## The three-strikes-then-rule heuristic\n\nDon't add a rule for every mistake. If you do, your `CLAUDE.md` becomes 5,000 lines of one-off lessons and the truly important rules get diluted.\n\nThe heuristic we use: wait for the same class of mistake to happen *three times* before promoting it to a `CLAUDE.md` rule.\n\nOnce is an outlier. Twice is suspicious. Three times is a pattern. Patterns are what `CLAUDE.md` is for.\n\nThere are exceptions. If a single mistake costs real money, real customer trust, or real production downtime, it goes in the file the first time. But for the everyday small mistakes, the agent typed `useState` when it should have used our custom `useStableState`, the agent imported from `@/components/old/Button` when the new one is at `@/components/Button`, wait for the third strike.\n\nThis heuristic also keeps the file *legible*. A 200-line `CLAUDE.md` of hard-won rules is more useful than a 5,000-line `CLAUDE.md` of every observation. The agent reads the entire file every session. Make every line earn its place.\n\n## Hub-and-spoke: directory-scoped CLAUDE.md\n\nYour root `CLAUDE.md` should not contain rules that only apply to part of your codebase.\n\nIf a rule applies only to your billing system, it shouldn't load when the agent is editing your CSS. Loading it anyway costs context budget the agent could be using to think.\n\nThe fix is a feature most people don't know exists: **Claude Code automatically loads any `CLAUDE.md` it finds in or above the directory of any file it reads.**\n\nThat means you can put a `CLAUDE.md` at any level of your file tree, and it will load *only* when the agent is working in that area:\n\n- `src/db/CLAUDE.md`, loads when the agent reads any file in `src/db/`\n- `src/api/auth/CLAUDE.md`, loads when the agent is in the auth subsystem\n- `src/components/CLAUDE.md`, loads when the agent is in any component\n\nWe have something like fifteen `CLAUDE.md` files sprinkled throughout the Chipp codebase. Each one is small, usually 20–80 lines, and *very* specific to its directory.\n\nThe result is that the agent always has *exactly* the context it needs and very little it doesn't. Hub-and-spoke is the difference between an agent that runs out of context budget on every task and one that finishes with 70% to spare.\n\nA practical layout looks like:\n\n```\n.\n├── CLAUDE.md                          # root: tech stack, conventions, top-level rules\n├── src\n│   ├── api\n│   │   ├── CLAUDE.md                  # API conventions, error handling, route patterns\n│   │   ├── auth\n│   │   │   └── CLAUDE.md              # auth-specific gotchas\n│   │   └── billing\n│   │       └── CLAUDE.md              # billing invariants, Stripe quirks\n│   ├── db\n│   │   └── CLAUDE.md                  # ORM rules, migration patterns, schema gotchas\n│   ├── components\n│   │   └── CLAUDE.md                  # design system rules, prop conventions\n│   └── services\n│       └── CLAUDE.md                  # service layer conventions\n└── tests\n    └── CLAUDE.md                      # test framework, fixture patterns\n```\n\nEach subdirectory `CLAUDE.md` should answer the question *\"What would I tell a senior engineer who's never worked in this directory before, on day one?\"*, and nothing more. If a rule is universal, push it up to the root. If it's universal *within a subdirectory*, push it down. Hub-and-spoke is a structure for keeping rules at the right level.\n\n## The auto-load table\n\nHub-and-spoke handles location-based context. But sometimes you want context to load based on *what the agent is doing*, not *where in the codebase it is*.\n\nFor example: if a ticket mentions \"billing,\" the agent should load our billing playbook before it starts work, even if it doesn't yet know which files it's going to touch. Static `CLAUDE.md` placement can't handle that, because the relevant docs aren't *in* a billing directory; they're in `/docs/billing.md`.\n\nWe solve this with an auto-load table. At the top of our root `CLAUDE.md`, we have a small markdown table:\n\n```markdown\n## Auto-load table\n\n| Mention | Read |\n|---|---|\n| billing, stripe, payment, subscription | docs/billing.md |\n| auth, login, session, oauth | docs/auth.md |\n| websocket, realtime, streaming | docs/realtime.md |\n| voice, livekit, transfer | docs/voice-agents.md |\n| migration, schema, kysely | docs/db-migrations.md |\n| feature flag, rollout, kill switch | docs/feature-flags.md |\n```\n\nThe pattern: when a prompt mentions any of these keywords, the agent reads the corresponding doc into context before starting work.\n\nWe don't load the docs in `CLAUDE.md` itself, that would burn the budget on every session, even sessions that don't need them. We load them dynamically, only when relevant.\n\nThis is how I keep my root `CLAUDE.md` lean (around 25k tokens) while still giving the agent rich context for any specific subsystem (5k–15k of additional context, only when needed).\n\nThe auto-load table is the part of the architecture that keeps the system *learning*. Every time a successful autonomous run produces a useful insight about a subsystem, that insight goes into a markdown file in `/docs/`, and a row in the auto-load table makes it available for future runs that touch the same area.\n\n## Sub-agent definitions\n\nYour `CLAUDE.md` should also define your sub-agents.\n\nA [sub-agent](/blog/skills-vs-sub-agents) is a separate Claude Code session your main agent can spawn to do bounded work in its own context window. They live in `.claude/agents/` as their own markdown files, and your root `CLAUDE.md` should reference them.\n\nOur root `CLAUDE.md` has a section like:\n\n```markdown\n## Sub-agents\n\nWhen the work calls for a deep investigation that would otherwise burn the\nmain context window, dispatch one of these sub-agents instead:\n\n- `infra-ops` — for Kubernetes, deploys, networking, log investigation.\n  Has the runbooks. Has the kubectl tools. Returns one-paragraph summaries.\n- `feature-deep-dive` — for understanding how an existing feature works\n  before modifying it. Returns an architecture summary.\n- `feature-dependency-mapper` — for refactor planning. Returns the call\n  graph and files affected.\n- `db-investigator` — for debugging data issues. Has read access to the\n  production database with safe-column allowlisting. Returns the\n  smoking-gun query results.\n\nWhen in doubt, prefer a sub-agent over filling the main context. The main\ncontext belongs to the work, not the investigation.\n```\n\nThe agent reads this in every session and knows when to dispatch what. We rely on it heavily, most of our autonomous tickets dispatch at least one sub-agent in the research phase.\n\n## Anti-patterns that wreck a `CLAUDE.md`\n\nSix failure modes I've seen on real teams. Avoid them.\n\n### The aspiration document\n\nRules like *\"Always write clean code\"* and *\"Prefer composition over inheritance\"*. These are too vague to act on. The agent acknowledges them and ignores them. Replace with specific scar-tissue rules.\n\n### The bloat\n\nA 5,000-line `CLAUDE.md` that includes everything anyone ever thought was important. Every session pays for the bloat in context budget. The truly important rules get diluted into noise. Apply three-strikes-then-rule. Push directory-specific stuff into hub-and-spoke files.\n\n### The freeze\n\nA `CLAUDE.md` that hasn't been edited in three months. Either your codebase has stopped evolving (it hasn't) or your agent is making the same mistakes over and over and you're not bothering to write them down. The latter is the diagnosis 95% of the time.\n\n### The wishlist\n\nRules describing what you *wish* the codebase were like, not what it actually is. The agent reads these, dutifully writes code that matches the wish, and breaks the codebase. Document the codebase as it exists. Migrate the codebase first, then update the rule.\n\n### The TODO ledger\n\n`CLAUDE.md` used as a personal scratch pad: *\"TODO: refactor the user service. NOTE: ask Hunter about the billing thing.\"* This pollutes the system prompt with irrelevant content. Use a tasks file or your issue tracker for this. `CLAUDE.md` is for rules.\n\n### The single source\n\nTrying to put everything in one file at the root. The hub-and-spoke pattern exists precisely because not everything belongs in the root. If your root `CLAUDE.md` is over 1,000 lines, you're failing to use the directory mechanism.\n\n## How to start one if you don't have one\n\nFor a fresh project: run `claude /init` to generate a starter file. Then immediately delete most of what it generates. The generated `CLAUDE.md` is too generic to be useful, it's a placeholder for the scar tissue you'll write yourself.\n\nReplace the boilerplate with three things:\n\n1. **Your tech stack.** One-line summary of what you're using. *\"Deno + Hono on the server, Svelte 5 on the client, Cloudflare Workers for delivery.\"*\n2. **The two or three most important rules.** Things you already know will trip the agent up. *\"We use Kysely, not Prisma. Migrations are expand-then-contract; never both at once.\"*\n3. **An empty auto-load table.** Just the header. You'll fill it in as you generate `/docs/` files.\n\nThat's the whole starter. Resist the urge to write more. The remaining content has to be earned through actual scar tissue.\n\n## How to fix one that's a mess\n\nIf you have a `CLAUDE.md` that's already 3,000 lines of garbage, the cleanup pattern is:\n\n1. **Audit for the three-strikes-then-rule violation.** Any rule that hasn't paid for itself by preventing a real bug? Cut it.\n2. **Push location-specific rules into hub-and-spoke.** A rule about your auth system goes in `src/api/auth/CLAUDE.md`. A rule about your CSS goes in `src/components/CLAUDE.md`. The root file should only have rules that apply everywhere.\n3. **Convert aspirations to scar tissue.** *\"Always validate input\"* becomes *\"Use the `assertValidUuid()` helper before any DB query against an id column. We've shipped four 500s from this exact mistake.\"*\n4. **Build the auto-load table.** Group related rules into `/docs/` files. Reference them in the auto-load table. Cut them from the root file.\n\nThis audit is itself a great task to hand to an agent. Open an interactive session, share the current `CLAUDE.md`, and ask the agent to apply the audit pattern. It'll do most of the work for you.\n\n## The compounding asset\n\nThe reason `CLAUDE.md` matters more than any other file in your codebase is that it's the only artifact whose value compounds with every successful autonomous run.\n\nEvery line of code you write decays, frameworks change, requirements shift, tests need updating. Every line in `CLAUDE.md` becomes more valuable, because it's preventing a class of mistake across all future sessions.\n\nAfter three years of running this discipline, your `CLAUDE.md` is the most important asset in your codebase, more valuable than any individual feature you've shipped, because it's the thing that lets you ship the *next* feature reliably.\n\nTreat it accordingly. Read it monthly. Edit it weekly. Audit it quarterly. The teams who run autonomous development at scale all have this in common: they take their `CLAUDE.md` seriously.\n\nThe teams that don't end up wondering why their agents are still making mistakes after six months. They are. The mistakes are documented in your bug tracker. They aren't documented in `CLAUDE.md`. That's the gap.\n\n**[Join the Alchemist waitlist →](/#waitlist)**\n\n---\n\nIf you want the full discipline of context engineering, read [Context Engineering: The Skill That Turns Claude Into a Production Co-Developer](/blog/context-engineering).\n\nIf you want to see how `CLAUDE.md` fits into a real autonomous pipeline, read [Building a Self-Healing Bug Bot](/blog/self-healing-bug-bot).\n\nIf you want to know when to use a skill versus a sub-agent (the next layer up from `CLAUDE.md`), read [Skills vs Sub-Agents](/blog/skills-vs-sub-agents).",
      "date_published": "2026-05-02T00:00:00.000Z",
      "authors": [
        {
          "name": "Hunter Hodnett"
        }
      ],
      "tags": [
        "claude-md",
        "claude-code",
        "claude-code-best-practices",
        "context-engineering",
        "autonomous-development"
      ]
    },
    {
      "id": "https://adaas.dev/blog/self-healing-bug-bot",
      "url": "https://adaas.dev/blog/self-healing-bug-bot",
      "title": "Building a Self-Healing Bug Bot: The Autonomous Dev System We Use at Chipp",
      "summary": "The implementation post. Five components, real bash, real Claude Code, and the system that ships 20-30 production changes a day at Chipp without a pull request in sight. Includes the harness skeleton you can copy, the MCP fleet we run, and an honest accounting of cost and failure modes.",
      "content_text": "At 3:47 AM, Bug Bot pushed a fix to production.\n\nI learned about it the next morning. The error had landed in our log drain at 3:31. The fix had shipped at 3:47. Sixteen minutes from production fire to production deploy.\n\nI was asleep through all of it.\n\nThis post is how to build the system that lets you sleep.\n\n## What Bug Bot is\n\nBug Bot is the autonomous development cluster that runs Chipp. It picks up production bugs and feature tickets from four trigger sources, runs each through a five-stage pipeline, and pushes verified code to production without human review. Eight workers run in parallel on a single workstation. We ship 20–30 production changes per day. There are no pull requests in this system.\n\nThe high-level case for it is in [The Autonomous Development Manifesto](/blog/autonomous-development). This post is the implementation. If you've been wondering whether you could build one of these for your own product, the answer is yes, and what follows is enough to start.\n\n## The architecture\n\n```\n   ┌───────────────────────────────────────────────────────┐\n   │                   TRIGGER LAYER                        │\n   │   Slack       Email       Grafana      P95 Latency    │\n   │    tag       forward      webhook         alert       │\n   └─────┬──────────┬───────────┬───────────────┬──────────┘\n         │          │           │               │\n         └──────────┴────┬──────┴───────────────┘\n                         ▼\n                  ┌──────────────┐\n                  │ TICKET QUEUE │\n                  └──────┬───────┘\n                         ▼\n            ┌──────────────────────────┐\n            │   BASH HARNESS POOL       │\n            │   (8 workers, max)        │\n            └────┬─────┬─────┬─────┬───┘\n                 │     │     │     │\n                 ▼     ▼     ▼     ▼\n            ┌────────────────────────┐\n            │  CLAUDE CODE PIPELINE  │\n            │                        │\n            │ Phase 0: Doc retrieval │\n            │ Phase 1: Research      │\n            │ Phase 2: Implement     │\n            │ Phase 3: Review        │\n            │ Phase 4: Docs update   │\n            │ Phase 5: Push          │\n            └─────────┬──────────────┘\n                      ▼\n                  [PRODUCTION]\n                      │\n                      ▼\n        (errors loop back via trigger layer)\n```\n\nFive components. Each one is replaceable; what matters is that they all exist and they all integrate.\n\n## Component 1: The trigger layer\n\nTickets need to land in your queue. We use four sources.\n\n### Loki + Grafana for production errors\n\nSelf-hosted log aggregation. Every server-side error is logged to Loki via a structured log call:\n\n```typescript\nlog.error({\n  service: 'billing',\n  feature: 'create_customer',\n  err: error.stack,\n  user_id: hashUserId(userId),\n});\n```\n\nSensitive fields are one-way encrypted before they ever land in logs. The agent gets metadata, not secrets.\n\nA Grafana alert rule fires every five minutes:\n\n```\nWHEN errors_count_5m > 5\n  AND grouped_by_stack_trace\nTHEN webhook(POST /bug-bot/trigger)\n```\n\nThe five-minute window dedupes, if 47 instances of the same error happen, the agent gets one ticket, not 47.\n\n### Webhook server\n\nA small Hono server listens for Grafana webhooks. Its only job is to construct a Bug Bot prompt and add it to the queue:\n\n```typescript\napp.post('/bug-bot/trigger', async (c) => {\n  const alert = await c.req.json();\n  const prompt = `\nProduction error detected.\n\nService: ${alert.labels.service}\nFirst seen: ${alert.firstSeen}\nAffected users: ${alert.uniqueUsers}\nStack trace:\n${alert.stackTrace}\n\nInvestigate and fix. The auto-load table will pull relevant docs.\n`;\n  await ticketQueue.add({ source: 'grafana', prompt });\n  return c.json({ ok: true });\n});\n```\n\n### Slack tag\n\nA Slack listener watches our internal `#chipp-rewrite-bugs` channel for `@bug bot` mentions. The thread becomes the prompt:\n\n```typescript\nslack.event('app_mention', async ({ event }) => {\n  const thread = await slack.getThread(event.thread_ts);\n  await ticketQueue.add({\n    source: 'slack',\n    prompt: thread.messages.map(m => `${m.user}: ${m.text}`).join('\\n'),\n  });\n});\n```\n\n### Email forward\n\nI forward customer emails to a Bug Bot inbox. A Mailgun webhook converts each email into a ticket. (I dictate most of mine via Whisper Flow on my phone, yes, I voice-message my engineering team.)\n\n### P95 latency alert\n\nA separate Grafana alert rule fires if our chat-streaming P95 exceeds three seconds. Different prompt template, same queue.\n\nThe shape of the trigger layer matters less than the principle: tickets should land in your queue from anywhere a human or system might notice a problem.\n\n## Component 2: The bash harness\n\nThis is the most important component. The bash harness is what turns Claude, non-deterministic, prone to running long, occasionally trying to `git push --no-verify` its way out of a problem, into a deterministic teammate.\n\n> \"An autonomous agent without a bash harness is an intern with no manager, no deadline, and an unlimited API budget.\"\n> — Hunter Hodnett, Chipp CTPO\n\nThe harness is shell script, not Node or Python. We've considered both. Bash wins because Claude is also writing the harness, and Claude has more bash training data than any other shell language. The harness needs to be readable, debuggable, and easily edited by the same agent it manages.\n\n### Skeleton\n\n```bash\n#!/usr/bin/env bash\nset -euo pipefail\n\nWORKER_ID=$1\nTICKET_FILE=$2\nWORKTREE_DIR=\"/tmp/bug-bot/worker-${WORKER_ID}\"\nDEV_PORT=$((5180 + WORKER_ID))\nIDLE_TIMEOUT=300  # 5 minutes\nBANNED_FLAGS=\"git push --no-verify|git reset --hard|rm -rf /\"\n\n# Set up worktree\ngit worktree add -b \"bot/${WORKER_ID}-$(date +%s)\" \"$WORKTREE_DIR\" main\ncd \"$WORKTREE_DIR\"\n\n# Spawn dev server in background on dedicated port\nPORT=$DEV_PORT pnpm dev > \"/tmp/bug-bot/worker-${WORKER_ID}.log\" 2>&1 &\nDEV_PID=$!\n\n# Run the 5-stage pipeline\nfor STAGE in research implement review docs push; do\n  PROMPT_FILE=\"prompts/${STAGE}.md\"\n\n  # Spawn Claude in headless mode\n  timeout $IDLE_TIMEOUT claude -p \\\n    --dangerously-skip-permissions \\\n    --append-system-prompt \"$(cat $PROMPT_FILE)\" \\\n    < \"$TICKET_FILE\" 2>&1 | tee \"stage-${STAGE}.log\" &\n  CLAUDE_PID=$!\n\n  # Banned-flag watch\n  while kill -0 $CLAUDE_PID 2>/dev/null; do\n    if grep -qE \"$BANNED_FLAGS\" \"stage-${STAGE}.log\"; then\n      echo \"BANNED FLAG DETECTED — killing worker $WORKER_ID\"\n      kill $CLAUDE_PID\n      exit 1\n    fi\n    sleep 2\n  done\n\n  wait $CLAUDE_PID\ndone\n\n# Force final commit + push if not already done\nif ! git diff --cached --quiet; then\n  git add -A\n  git commit -m \"[bug-bot/${WORKER_ID}] $(cat ticket-summary.txt)\"\nfi\ngit push origin HEAD:staging\n\n# Cleanup\nkill $DEV_PID\ngit worktree remove \"$WORKTREE_DIR\" --force\n\n# Log outcome for fine-tuning\necho \"{\\\"worker\\\": ${WORKER_ID}, \\\"ticket\\\": \\\"$(basename $TICKET_FILE)\\\", \\\"outcome\\\": \\\"clean\\\"}\" \\\n  >> /var/log/bug-bot/outcomes.jsonl\n```\n\nThis is simplified. Our production version has more error handling and outcome labeling. The shape is right.\n\n### What the harness enforces that Claude can't\n\n- **Idle kill.** If Claude doesn't fire a tool call for five minutes, the session is killed. This catches the case where Claude gets stuck in a \"let me think about this\" loop.\n- **Banned-flag grep.** If Claude attempts `git push --no-verify`, `git reset --hard`, or `rm -rf` against an absolute path, the session is aborted.\n- **Forced commit + push.** Claude occasionally completes work but forgets the final push. The harness checks the worktree state and forces it.\n- **Worktree cleanup.** Each run is isolated; nothing leaks between workers.\n- **Port allocation.** Each worker gets a dedicated dev server port (5180 + worker ID).\n- **Outcome logging.** Every run writes a JSONL row to a fine-tuning archive. (More on this below.)\n\n## Component 3: The five-stage Claude pipeline\n\nEach stage is its own Claude Code session, with its own context window. The stages communicate via files written to disk.\n\n### Phase 0: Doc retrieval (bash, not Claude)\n\nBefore any Claude session runs, a bash script semantic-searches `/docs/` for files relevant to the ticket and writes the results to `pre-context.md`:\n\n```bash\ndocs-search \"$(cat ticket.txt)\" > pre-context.md\n```\n\n`docs-search` is a small CLI we wrote that runs OpenAI's embeddings API over our `/docs/` folder once per week and stores vectors in a local SQLite file. Could be any vector store. The point is to load relevant context before Claude opens its first context window.\n\n### Phase 1: Research\n\n```\nYou are the research agent for an autonomous dev pipeline.\n\nRead the ticket. Read pre-context.md. Read relevant code.\nQuery Loki for similar errors. Query the database if useful.\nForm a hypothesis.\n\nOutput a plan.md with:\n- Root cause\n- Affected files\n- Implementation steps\n- Test strategy\n- Risks\n\nDO NOT edit any source files in this phase.\n```\n\nOutput: `plan.md`. Context window can fill up to 1M tokens of investigation; only the plan survives.\n\n### Phase 2: Implement\n\n```\nYou are the implement agent.\n\nRead plan.md. Read pre-context.md. That's your context.\nMake the code changes described in plan.md.\nRun unit tests for affected files.\nRun the full test suite.\nSpin up dev server on port ${DEV_PORT}.\nOpen browser MCP. Navigate to affected URLs.\nRead browser console + dev server logs.\nFix anything broken.\n\nCommit your changes when verified.\n```\n\nFresh context window. The agent never sees the original investigation, only the distilled plan.\n\n### Phase 3: Review\n\n```\nYou are the review agent.\n\nRead the diff. Red-team it.\nLook for: edge cases, security issues, type errors, broken contracts.\nYou can edit. If you make more than 5 edits, the implement agent's work is flagged messy.\n\nOutput: approved | needs-rework, plus reasoning.\n```\n\n### Phase 4: Docs update\n\n```\nYou are the docs agent.\n\nGiven the diff, identify any non-obvious behavior introduced.\nWrite or update markdown files in /docs/ to capture it.\nIdentify any docs the change has invalidated. Prune them.\nUpdate the auto-load table at the top of CLAUDE.md if needed.\n```\n\nThis is how the system gets smarter over time.\n\n### Phase 5: Push\n\nBash, not Claude. Final commit, push to staging branch, monitor deploy.\n\n## Component 4: The MCP fleet\n\nWithout MCP, your agent can read code and reason. With MCP, it can verify, query, and act. The four MCPs every Bug Bot setup needs:\n\n### Browser MCP (custom, dev-tools protocol)\n\nThis is the single most important MCP in autonomous development. Without it, you're guessing.\n\nOur browser MCP wraps a local Chromium instance via the dev-tools protocol. It exposes:\n\n- `browser_navigate(url)`, go to a page\n- `browser_screenshot()`, return a base64 image\n- `browser_console_logs()`, return recent console messages\n- `browser_click(selector)`, interact with the page\n- `browser_dev_login(role)`, bypass our auth flow with seeded test credentials\n\nThat last tool is the differentiator. Off-the-shelf browser MCPs are generic. The MCP we run for Chipp knows how to log in as a free user, an enterprise user, or a paying user with exhausted credits, without going through the human OAuth flow. That domain knowledge is what makes verification fast.\n\n### Log-drain MCP (custom)\n\nWraps Loki. Exposes:\n\n- `loki_query(labels, time_range)`, run a LogQL query\n- `loki_user_breadcrumbs(user_id, time_range)`, pull a user's recent interactions before the error fired\n\nThe user breadcrumbs tool is what lets the agent reconstruct the user journey that led to a bug, and propose fixes that match real usage, not synthetic edge cases.\n\n### Database MCP (custom)\n\nWraps our database with hard-coded safe column lists. We give the autonomous agents read access to production. The MCP enforces:\n\n- No `SELECT *`. The MCP returns only the columns you've explicitly allowed.\n- Sensitive columns (passwords, OAuth tokens, payment methods) are filtered out at the MCP layer; the agent never sees them in any session.\n- All queries are read-only by default. We have a write-enabled variant gated behind an additional bash-harness check.\n\nWe tried off-the-shelf database MCPs first. They hallucinated column names constantly. Custom won.\n\n### File system + bash (built-in)\n\nClaude Code includes file system and bash tools by default. You don't need to install these. You do need to ensure your `CLAUDE.md` documents which paths are off-limits and which commands are dangerous.\n\n## Component 5: The verification loop\n\nThe browser MCP is the loop. Here's the actual sequence each implement agent runs after writing code:\n\n1. Code changes saved in worktree.\n2. Worktree's dev server (already running on dedicated port) auto-reloads.\n3. Agent calls `browser_navigate('localhost:5184/affected-page')`.\n4. Agent calls `browser_screenshot()`. Reads the image.\n5. Agent calls `browser_console_logs()`. Reads the console output.\n6. If no errors, the agent calls `browser_click('#confirm')` to interact with the changed UI.\n7. Repeat screenshot + logs read.\n8. If errors detected, the agent forms a hypothesis, edits the code, and the loop starts over.\n\nThe loop is what separates autonomous development from vibe coding. Vibe coding ends with the diff. Autonomous development ends with verified production code.\n\n> \"Claude writing code without verification is a liability. Claude writing code and verifying and pushing to prod is a teammate with commit access.\"\n> — Hunter Hodnett, Chipp CTPO\n\n## Outcome logging for fine-tuning\n\nEvery Bug Bot run writes a JSONL row to a long-term archive:\n\n```json\n{\n  \"ticket_id\": \"billing-create-customer-null-pmt\",\n  \"trigger_source\": \"grafana\",\n  \"started\": \"2026-04-15T03:31:18Z\",\n  \"finished\": \"2026-04-15T03:47:02Z\",\n  \"stages\": {\n    \"research\": { \"tokens\": 412053, \"tool_calls\": 38 },\n    \"implement\": { \"tokens\": 187234, \"tool_calls\": 23 },\n    \"review\": { \"tokens\": 91482, \"tool_calls\": 12, \"edits\": 1 },\n    \"docs\": { \"tokens\": 43210, \"tool_calls\": 4 },\n    \"push\": { \"tokens\": 0, \"tool_calls\": 0 }\n  },\n  \"outcome\": \"clean\",\n  \"regressions_detected_24h\": false\n}\n```\n\nThe `outcome` field is the label. `clean` means: review made ≤5 edits, all tests passed first try, no regressions detected within 24 hours of deploy. `messy` means anything else.\n\nThis data is gold. Every successful autonomous run produces a labeled training row showing how a frontier model approached a real engineering task. Builders who treat their pipeline outputs as a strategic data asset, instead of throwing them away after each run, end up with the training data to fine-tune cheaper specialized models on their own codebase. That's a moat. We'll cover the mechanics of it in a future post.\n\n## The cost reality\n\nBug Bot is not free. Each ticket runs through five Claude Code sessions, each with substantial context. Order of magnitude: low double-digit dollars per ticket on a frontier model, at current pricing.\n\nThat sounds expensive until you compare it to the alternative. A single Bug Bot ticket replaces approximately a junior engineer's day of work, read the stack trace, find the bad commit, write the fix, test it, ship it. The cluster runs all day, all night, with no benefits package.\n\nWe get roughly a 10–50x cost advantage versus traditional engineering labor for the kind of work Bug Bot does best (fixing bugs in well-documented code paths, building features within an established architecture). For more open-ended work, designing new systems, debugging hardware integrations, reasoning about edge cases that aren't represented in our training data, the cost advantage compresses, sometimes to break-even.\n\nThe honest truth: Bug Bot succeeds on first try about 70–80% of the time. The other 20–30% require a re-prompt, often because we didn't include enough context the first time. We treat those failures as scar tissue. Almost every re-prompt becomes a doc, a `CLAUDE.md` rule, or an auto-load table entry that prevents the same failure next time.\n\n## When this fails (and how we fix it)\n\nFailure modes worth knowing about before you start:\n\n### Cross-tool integrations\n\nAnything outside your code base is high risk. Bug Bot is great at fixing bugs in our own code. It's worse at debugging issues with a Stripe API change, a LiveKit voice agent update, or any third-party service whose behavior the agent can't directly observe.\n\nThe fix is custom MCPs. We built a Stripe MCP that wraps Stripe's API in tools the agent can call directly. Same for LiveKit. The pattern: any external dependency that breaks Bug Bot's success rate gets its own MCP server.\n\n### Decomposition failures\n\nBug Bot is designed for tasks that fit in one pipeline run. *\"Fix this billing bug\"* works. *\"Build a new analytics dashboard with 12 widgets\"* doesn't.\n\nThe bottleneck isn't execution. It's decomposition. Large features need a human (or another autonomous layer) to break them into pipeline-sized tickets. We handle this manually for now. The next iteration of Bug Bot will include a decomposition stage that runs before the research stage.\n\n> \"Hard part is decomposition, not execution.\"\n> — Hunter Hodnett, Chipp CTPO\n\n### MCP server downtime\n\nIf your browser MCP or database MCP goes down, your agents lose their senses mid-session. We treat MCP servers as production infrastructure: monitored, alerted, deployed in pairs.\n\n### Banned-flag false positives\n\nOccasionally the harness kills a session for what looks like a banned flag in a comment or test fixture. We've tightened the regex over time. When in doubt, log the false positive and investigate; don't relax the regex pre-emptively.\n\n## What this gives you\n\nThe 3:47 AM moment becomes routine.\n\nThe on-call rotation goes empty. PagerDuty escalations stop. Senior engineers stop reviewing AI-generated PRs because there are no PRs. The PR queue empties because there's no concept of a PR in this system. Customer-reported bugs get fixed before the customer support team has finished writing the ticket.\n\nYou sleep through the night. You wake up to a Slack channel full of completed work. You spend your day on decomposition, judgment, and the kinds of architectural decisions only a human can make, because every other thing has been done by the cluster.\n\nThat is what Bug Bot gives you. It's also what we're productizing as Alchemist for builders who'd rather not spend nine months building it themselves.\n\n**[Join the Alchemist waitlist →](/#waitlist)**\n\n---\n\nIf you want the foundational case for autonomous development, start with [The Autonomous Development Manifesto](/blog/autonomous-development).\n\nIf you want to understand the discipline that makes this all work, the foundation underneath the harness, the pipeline, and the MCPs, read [Context Engineering: The Skill That Turns Claude Into a Production Co-Developer](/blog/context-engineering).",
      "date_published": "2026-05-01T00:00:00.000Z",
      "authors": [
        {
          "name": "Hunter Hodnett"
        }
      ],
      "tags": [
        "bug-bot",
        "autonomous-coding",
        "agentic-workflows",
        "claude-code",
        "self-healing"
      ]
    },
    {
      "id": "https://adaas.dev/blog/context-engineering",
      "url": "https://adaas.dev/blog/context-engineering",
      "title": "Context Engineering: The Skill That Turns Claude Into a Production Co-Developer",
      "summary": "Context engineering is the foundational discipline of autonomous development, and the source of most of your hallucinations, token bills, and pipeline failures. Four core moves, the patterns we run every day at Chipp, and how to know it's working.",
      "content_text": "There is a finite amount of paper your model can write on.\n\nEverything you've ever read about getting the most out of an AI agent, every tip about prompts, every technique for \"making Claude smarter\", collapses into managing what fits on that paper.\n\nThis is context engineering. It's not a mindset. It's not a philosophy. It's a set of mechanical decisions about what tokens go into the context window, what tokens get compressed, and what tokens never make it in at all.\n\nGet it right and your agent finishes its work, ships, and moves on. Get it wrong and your agent enters a doom loop of re-reading the same files, summarizing its own findings, hallucinating results, and asking you for context it should have inferred.\n\nIn our [autonomous development cluster](/blog/autonomous-development) at Chipp, context engineering is the difference between an agent that ships 25 production changes a day and one that fills a context window with 90 tool calls of confusion before timing out. We've made the mistakes. This post is the rules I wish I'd known a year ago.\n\n## What's actually in your context window\n\nEvery Claude Code session has six things competing for the same paper:\n\n1. **The system prompt.** Your `CLAUDE.md`, plus any subdirectory `CLAUDE.md`s loaded by the hub-and-spoke pattern.\n2. **The tool definitions.** Every MCP server adds its tool schemas. The descriptions are part of every prompt.\n3. **The conversation history.** Every prior user message, assistant message, and tool call result since the session started.\n4. **The tool results.** File contents you read. Grep results. Database query rows. Browser screenshots. All of these get serialized into context.\n5. **The reasoning scratch space.** Where the model thinks. Internal chain-of-thought tokens that don't show up in the final output.\n6. **The current user message and the pending response.**\n\nYou don't get to choose how the model allocates between these. You get to choose what's available to fill them.\n\nA typical Claude Sonnet/Opus session has 1M tokens of context. Your `CLAUDE.md` should fit in under 25k of that, about 5% of the budget. Tool definitions for a typical autonomous setup will run another 20–50k. That leaves roughly 900k for actual work.\n\nThe math sounds generous until you read three files of 1,500 lines each, run a few grep commands, and take two browser screenshots. You're 200k tokens into your remaining budget before the agent has done anything productive.\n\n## The compaction trap\n\nWhen the context window fills, the model doesn't gracefully degrade. It hits **compaction**: a smaller, cheaper model summarizes the entire conversation into a paragraph and replaces the original tokens. The session continues, but the model wakes up with vague recollections instead of specifics.\n\nWhat you lose in a compaction:\n\n- The exact file contents you read.\n- The specific stack traces and error messages from earlier in the session.\n- The reasoning chain that led to the current state.\n- Specific line numbers, variable names, and database row values.\n\nWhat survives:\n\n- The system prompt (your `CLAUDE.md`).\n- A paragraph summary of everything that was compacted.\n- The most recent few messages.\n\nA summary written by a cheap model is a worse representation of the past than the actual past. After a compaction, the agent is reasoning over a faded photocopy of its memory. This is where hallucinations come from.\n\n> \"Compaction is incredibly destructive. You really want to avoid compactions at all cost.\"\n> — Hunter Hodnett, Chipp CTPO\n\nThe corollary: your `CLAUDE.md` is your highest-leverage file. It survives compactions. Everything else is in danger.\n\n## The four core moves\n\nContext engineering, in practice, is four moves you make over and over. None of them are hard. The discipline is doing them consistently.\n\n### Move 1: Stabilize the system prompt for KV-cache hits\n\nAnthropic's API caches input tokens. If your system prompt is byte-identical between two requests, the cached version costs you a fraction of the original.\n\nThis sounds obvious until you realize how easy it is to bust the cache by accident. We were burning through input tokens at full price for months before we figured out our bug.\n\nOur `CLAUDE.md` had a line that injected the current date so the agent would know what day it was. We were injecting the date down to the second:\n\n```\nThe current date and time is: 2026-05-06T14:23:47.318Z\n```\n\nThat value changed on every request. The cache busted on every request. We were paying full price for a 25k-token system prompt thousands of times a day.\n\nThe fix was trivial:\n\n```\nThe current date is: 2026-05-06\n```\n\nKV cache hits jumped from near-zero to over 90%. Token spend dropped accordingly.\n\nThe general rule: anything in your system prompt that varies between calls, timestamps, request IDs, randomly-ordered lists, busts the cache. Make the system prompt stable. Inject volatile context as user messages, not as system prompt content.\n\n### Move 2: One context window per goal\n\nThe single biggest mistake teams make is trying to do too much in one session.\n\nYou start a Claude Code session. You ask it to investigate a bug. It reads ten files, runs a few grep commands, queries the database. It forms a hypothesis. You ask it to implement the fix. It writes the code, runs the tests. You ask it to review the code. It edits a few things. You ask it to update the docs. By now you're six tool calls past compaction and the agent's reasoning has gone fuzzy.\n\nThe fix: break the work into stages, and start a fresh context window for each stage.\n\nThis is what our [Bug Bot pipeline](/blog/self-healing-bug-bot) does. Five stages, research, implement, review, docs, push, and each stage is a separate session. The output of one stage is a markdown file, which becomes the input for the next.\n\nStage 1 fills its context window with research and outputs a `plan.md`.\nStage 2 starts fresh, reads only `plan.md`, writes the code.\nStage 3 starts fresh, reads only the diff, reviews it.\n\nNo stage ever runs out of room because no stage tries to do everything.\n\n### Move 3: Use sub-agents to dilute\n\nSome work is inherently context-heavy. Investigating a Kubernetes pod restart can require reading thousands of lines of logs, querying multiple endpoints, cross-referencing deploy histories. If you do this in your main session, you've burned your budget.\n\nThe solution: spawn a sub-agent. The main session calls the sub-agent like any tool, the sub-agent gets its own fresh 1M-token context window, it does whatever investigation it needs, and it returns a one-paragraph insight.\n\nThe first time we used this in production was for an infrastructure issue. Pods were restarting; we didn't know why. I prompted the main session: *\"Figure out why our pods are restarting.\"* It spawned an `infra-ops` sub-agent we'd configured with all our Kubernetes runbooks.\n\nThe sub-agent ran 47 `kubectl` commands. Queried Loki for recent error patterns. Cross-referenced the deploy history. Filled almost a full context window with raw evidence.\n\nThen it returned one sentence: *\"OOM after the last deploy, memory limit too low; recommend bumping the limit from 512Mi to 1Gi.\"*\n\nThat sentence, 23 tokens, was what landed in my main session. The 950k tokens of evidence stayed in the sub-agent's context, where it belonged.\n\nUse sub-agents for any work where the answer is short but the investigation is long.\n\n### Move 4: Pre-load with an auto-load table\n\nHub-and-spoke `CLAUDE.md` works for static, location-based context. But sometimes you want context to load based on *what the agent is doing*, not *where in the codebase it is*.\n\nWe built an auto-load table for this. At the top of our root `CLAUDE.md`, we have a small markdown table:\n\n```markdown\n## Auto-load table\n| Mention | Read |\n|---|---|\n| billing, stripe, payment, subscription | docs/billing.md |\n| auth, login, session, oauth | docs/auth.md |\n| websocket, realtime, streaming | docs/realtime.md |\n| voice, livekit, transfer | docs/voice-agents.md |\n```\n\nThe pattern: when a prompt mentions any of these keywords, the agent reads the corresponding doc into context before starting work.\n\nWe don't load the docs in `CLAUDE.md` itself, that would burn the budget on every session, even sessions that don't need them. We load them dynamically, only when relevant.\n\nThis is how I keep my root `CLAUDE.md` lean while still giving the agent rich context for specific subsystems.\n\n> \"I have my autonomous AI cluster updating its own `CLAUDE.md`. I honestly barely know what's in there these days.\"\n> — Hunter Hodnett, Chipp CTPO\n\n## The mental model\n\nPicture the context window as a single sheet of paper, fixed font size.\n\nWhen you read a file, you've copied that file onto the paper.\nWhen you run a grep, you've copied the result.\nWhen the agent reasons, it's writing on the paper.\n\nRun out of room and the paper gets folded. A cheap intern reads everything you wrote and replaces it with a paragraph summary on a fresh sheet. You keep working but you've lost the details.\n\nThe discipline of context engineering is engineering what gets written on the paper before it runs out, and never letting the cheap intern get involved.\n\n## Patterns we use every day\n\nBeyond the four core moves, here are the patterns that show up most in our daily work.\n\n### Fresh-context handoff via markdown\n\nPipeline stages communicate by writing markdown files to disk. Stage 1's last action is `Write plan.md`. Stage 2's first action is `Read plan.md`. Stage 1's context window is gone forever, but the distilled insight survives.\n\nThis is the same pattern as the sub-agent dilution move, applied to sequential work.\n\n### Three-strikes-then-rule for `CLAUDE.md`\n\nDon't add a rule to `CLAUDE.md` after a single mistake. Wait for the same class of mistake to happen three times. Otherwise your `CLAUDE.md` bloats with one-off lessons that never recur, and the truly important rules get diluted.\n\nThree strikes is a heuristic, not a hard rule. The point is to be conservative about what gets the elevated status of \"every-session context.\"\n\n### Hub-and-spoke directory loading\n\nPlace a `CLAUDE.md` in any subdirectory where the rules differ from the root. Claude Code automatically reads the nearest `CLAUDE.md` when it reads a file in that directory.\n\nWe have `CLAUDE.md` files in:\n\n- `src/db/`. ORM-specific rules (we use Kysely, not Drizzle; never let the agent forget)\n- `src/api/`. API conventions (Hono routing, error-handling patterns)\n- `src/components/`, design system rules (CSS variables only, never hex codes)\n- `tests/`, test framework conventions\n\nThe agent loads the right one without me having to tell it.\n\n### Kill the summary mid-stream\n\nWhen you see Claude write something like *\"I've now read several files. Let me summarize what I learned…\"* in the middle of an interactive session, stop it. That summary is about to land in your context as the canonical record of what the agent did. You want the *evidence*, not a pre-compaction.\n\nTell it: *\"Don't summarize. I want to see the actual results.\"*\n\nThis matters less in autonomous pipeline runs, you're not watching those, but the underlying principle is general: prefer raw artifacts over the agent's interpretation of artifacts.\n\n### Use the most powerful model: every time\n\nWhen teams ask me how to save money on token spend, the first thing I say is: don't.\n\nUse the most expensive model. Always. Even when it feels wasteful.\n\nThe reason is that frontier models hallucinate less, plan better, and finish work in fewer total tokens. A cheaper model in an autonomous setting will burn more total tokens chasing its own mistakes than a frontier model would have spent doing the work right the first time.\n\nThis is even more true in the early weeks after a model release. Frontier labs subsidize new models, they serve the highest-quality version at launch and gradually quantize them down to cheaper-to-serve versions over the following weeks. If you're going to do hard work with an autonomous agent, do it in the first weeks after a release, when the model is at its sharpest.\n\n### Write your context-engineering scars into auto-load docs\n\nWhen you encounter a context-engineering failure, say, the agent kept reading the wrong file because it didn't have enough context about a subsystem, don't put the lesson in your root `CLAUDE.md`. Write a doc into `/docs/`, add a row to the auto-load table, and move on.\n\nAuto-loaded docs are scoped to relevance. Root `CLAUDE.md` is global. Match the scope of the lesson to the scope of the file.\n\n### Capture every run's data\n\nEvery Claude Code session you run produces a record of how a frontier model reasoned about a real problem in your codebase. That's training data, the kind that, six months from now, you might want to fine-tune a cheaper specialized model on. The builders who treat their pipeline outputs as a strategic data asset, instead of throwing them away after each run, will end up with the only kind of moat that compounds in this industry.\n\nEven if you never do anything with the data, archive it. Storage is cheap. Past inference is irreplaceable.\n\n## How you know it's working\n\nYou know context engineering is working when:\n\n- The agent finishes work without compacting.\n- Re-running similar tasks gives consistent quality.\n- You can launch sessions, walk away, and come back to shipped code.\n- The agent stops asking you for context it should have inferred.\n- Your `CLAUDE.md` grows by a line or two per week, not per day.\n\nYou know it's not working when:\n\n- The agent hits compaction in the middle of routine tasks.\n- You see the same hallucinations repeatedly (this is your scar tissue not yet hardened into rules).\n- The agent reads the same files in every session because there's no doc layer to load them once and remember.\n- Token spend per task is going up, not down.\n\nIf you're in the second category: start with one core move at a time. Get the system prompt stable for KV cache hits. Then split your sessions one-context-per-goal. Then add sub-agents for any investigation that fills the budget. Auto-load tables are the polish; the moves above are the foundation.\n\n## What's next\n\nContext engineering is the foundational discipline of autonomous development, but it's only the first layer. Layered on top of it:\n\n- **[CLAUDE.md Architecture](/blog/claude-md-architecture)**: the hub-and-spoke pattern, scar tissue practice, and auto-load tables in detail.\n- **[Skills vs Sub-Agents](/blog/skills-vs-sub-agents)**: when to use which, and why the distinction matters.\n- **[MCP Is the USB-C of AI](/blog/building-your-first-mcp-server)**: building the senses that let your agent verify its own work.\n- **[Building a Self-Healing Bug Bot](/blog/self-healing-bug-bot)**: context engineering applied end-to-end in a production pipeline.\n\nIf you want to see what context engineering enables in production, start with [The Autonomous Development Manifesto](/blog/autonomous-development).\n\nIf you want to see what it looks like when it goes wrong, run an interactive Claude Code session for an hour without thinking about any of this. Watch your token spend. Watch the agent compact. Watch it hallucinate. That's the baseline. Everything above is what we do to escape it.\n\n**[Join the Alchemist waitlist →](/#waitlist)**",
      "date_published": "2026-04-30T00:00:00.000Z",
      "authors": [
        {
          "name": "Hunter Hodnett"
        }
      ],
      "tags": [
        "context-engineering",
        "claude-code",
        "claude-code-best-practices",
        "autonomous-development"
      ]
    },
    {
      "id": "https://adaas.dev/blog/autonomous-development",
      "url": "https://adaas.dev/blog/autonomous-development",
      "title": "The Autonomous Development Manifesto: Why We're Building Alchemist",
      "summary": "An engineering team of two. 20–30 production deploys a day. No PR queue, no on-call rotation. This is the case for autonomous development, the discipline, the architecture, the maturity curve, and ADaaS as a category. Plus the system we run at Chipp and the product we're building from it.",
      "content_text": "Software development is changing. Again.\n\nFor fifty years, software was written by humans typing code one keystroke at a time. For the last five, by humans typing code while a model whispered the next token. We told ourselves the model was an assistant, that the human was still the author.\n\nWe were wrong about how much of the job could be automated. Not just the typing. The reading. The reasoning. The verification. The shipping.\n\nI run an engineering team of two people. We push 20 to 30 changes to production every day. The pull request queue is empty because there are no pull requests. The on-call rotation is empty because there is no on-call rotation. I have slept through the night for the last two months.\n\nThis isn't a vibe-coding demo. This isn't \"AI is going to change everything someday.\" This is a description of the system that built and runs Chipp, the platform behind 105,000+ AI agents reaching 89M+ end users. The system we call Bug Bot. The system that, over the next year, we are productizing as **Alchemist**, and offering as **autonomous development as a service**.\n\nThis post is for the engineers who suspect this is real but haven't seen it for themselves.\n\nIt's also for the engineers who think they're better than AI. You are. For now. The window in which that's true is closing faster than most people realize.\n\nLet me show you what's on the other side.\n\n---\n\n## Part 1: What \"autonomous development\" actually means\n\nThe terms are getting thrown around and they're not all the same thing.\n\n**Autocomplete.** A model finishes your line. You're still the author.\n\n**AI pair programmer.** A model writes a function. You read it, edit it, commit it. Copilot's original pitch.\n\n**Vibe coding.** A model writes most of the code. You review the diff, occasionally course-correct, ship the result. The human is still in the loop on every step.\n\n**Agentic coding.** The model executes tools, runs commands, reads files, calls APIs, to accomplish a goal. The human still drives the session.\n\n**Autonomous development.** The model is given a goal. It selects its own steps, executes them, verifies the results, and ships. There is no human in the loop on individual steps. There may not be a human in the loop on the *outcome* either, if you set it up that way.\n\nThe distinction that matters most: **what happens when the model writes wrong code?**\n\nIn vibe coding, you catch it in the diff and tell the model to try again.\n\nIn autonomous development, the model catches it itself, by spinning up a dev server, opening a browser, navigating to the changed page, taking a screenshot, reading the console logs, noticing the error, and fixing it.\n\nThat last sentence is the entire game. Everything else in this post is mechanical detail about how to get there.\n\n> \"Claude writing code without verification is a liability. Claude writing code and verifying and pushing to prod is a teammate with commit access.\"\n> — Hunter Hodnett, Chipp CTPO\n\n---\n\n## Part 2: The maturity curve\n\nWe're somewhere in the middle of a five-stage industry shift. Where you sit on this curve will determine whether your business compounds or capsizes over the next twenty-four months.\n\n**Stage 1. Acceleration.** Faster typing. Slightly fewer keystrokes per shipped feature.\n\n**Stage 2. Augmentation.** Models write the boring code. Humans write the novel code and review everything. The senior engineer still does most of the thinking.\n\n**Stage 3. Vibe coding.** Models do most of the typing. Humans become editors and product designers in the loop. Output velocity 5–10x. Demos are great. Production code is hit or miss. Most builders are stuck here today.\n\n**Stage 4. Agentic coding.** Models can run tools, read files, search the web, query databases, open browsers. With the right setup, they can verify their own work. The human still drives the session. Output velocity 20–50x for a session, but each session needs a human to launch and oversee.\n\n**Stage 5. Autonomous development.** Models run unattended. Multiple agents in parallel. Goal-directed. Self-verifying. The human role is decomposition (turning intent into tickets) and judgment (reviewing outcomes). Output velocity is no longer the right metric, *organizational capacity* is. This is what we run. This is what Alchemist productizes.\n\nEach stage doesn't replace the prior one. It absorbs it. Inside an autonomous development cluster, you'll find vibe-coding sessions, agentic coding loops, and AI pair programming. They're all in the box. The cluster is what wraps them in a deterministic harness.\n\nIf your team is at Stage 3 and your competitor is at Stage 5, your competitor isn't 10% faster. They're operating in a different category of business. Their feature velocity isn't bottlenecked by engineer count. Yours is.\n\n---\n\n## Part 3: ADaaS (Autonomous Development as a Service)\n\nHere's the awkward part. Most teams should not build their own autonomous development cluster from scratch.\n\nWe did, because we had to. Three years of full-time work. Dozens of mistakes, including a few production database drops along the way. An immense amount of inference burned figuring out which prompts, harnesses, and architectures actually hold up under autonomous load.\n\nThe hard parts aren't where you'd expect them. They're not in the model. The model is fine. They're in:\n\n- **The bash harness**: the deterministic wrapper that prevents your non-deterministic agents from running away with your token budget.\n- **The verification loop**: getting the agent to actually look at the running software and see what it built.\n- **The context engineering discipline**: knowing what to load, what to prune, what to summarize, what to never compact.\n- **The CLAUDE.md scar-tissue practice**: turning every agent mistake into a rule that prevents that mistake forever.\n- **The pipeline decomposition**: splitting one task across multiple context windows so no single window has to hold everything.\n\nYou will spend nine months learning these the hard way. We already did. So we are productizing it.\n\nThat's what **ADaaS** (autonomous development as a service) means. You don't rent a tool. You don't rent a model. You rent a *process*: a configured cluster, a hardened harness, a verification loop, a maintained set of MCP servers, a documentation auto-load system, the whole apparatus that lets you point at a goal and get shipped software.\n\nThe lineage:\n\n- **SaaS** productized software.\n- **PaaS** productized application platforms.\n- **IaaS** productized infrastructure.\n- **ADaaS** productizes the engineering practice itself.\n\nWe coined the term because nobody else had. Don't be surprised when the rest of the industry catches up to it within a year, the underlying economic logic is too powerful to stay underground.\n\n---\n\n## Part 4: Inside the autonomous dev cluster\n\nLet me show you what's running on my desk right now.\n\nI have eight Claude Code workers running in parallel on a single workstation. Each worker is an instance of `claude -p`. Claude Code's headless mode, wrapped in a bash script that monitors its output, kills it if it stalls, and forces a commit at the end.\n\nTickets enter the system through four trigger sources:\n\n1. **Slack tag.** Anyone on my team can `@bug bot` in our internal channel. The mention payload becomes the prompt.\n2. **Email forward.** I forward customer emails to a Bug Bot inbox. The email body becomes the prompt. I dictate most of mine via Whisper Flow on my phone, yes, I voice-message my engineering team.\n3. **Grafana webhook.** Every five minutes, our log ingestion pipeline groups recent production errors by stack trace, dedupes them, and fires a webhook. The error context becomes the prompt.\n4. **Performance alert rule.** When P95 chat-streaming latency exceeds three seconds, a worker spawns automatically.\n\nWhen a ticket lands, it runs through five stages. **Each stage gets its own context window.** This is the most important architectural rule in the whole system.\n\n```\n[Trigger] → [Phase 0: Doc retrieval] → [Phase 1: Research]\n         → [Phase 2: Implement]   → [Phase 3: Code review]\n         → [Phase 4: Docs update] → [Phase 5: Push to prod]\n```\n\n**Phase 0. Doc retrieval.** Before Claude even sees the ticket, a bash script semantic-searches our `/docs/` folder for any markdown files relevant to the ticket. Those docs get prepended to the research prompt as context.\n\n**Phase 1. Research.** Fresh context window. Claude reads relevant code, queries our log drain (Loki) for similar past errors, queries our production database for affected rows, and outputs a `plan.md` file describing the root cause and proposed fix. Sensitive fields in the database are one-way encrypted; the agent can pattern-match on metadata without ever seeing customer secrets.\n\n**Phase 2. Implement.** Brand new context window. Claude is handed the `plan.md` (no other context) and writes the code. Runs the unit tests. Spins up a local dev server in a git worktree on a dedicated port. Opens Chromium via the browser MCP. Navigates to the affected page. Takes a screenshot. Reads the dev server logs and the browser console logs. If anything's wrong, it fixes itself.\n\n**Phase 3. Code review.** Another fresh context window. This one gets the full diff and is told to red-team the implementation. It can edit. If it makes more than a handful of edits, the harness flags the original implementation as `messy`.\n\n**Phase 4. Docs update.** Yet another fresh context. Given the diff, this agent writes new documentation into the `/docs/` folder for any non-obvious behavior the change introduces, and prunes any docs the change has invalidated. This is how the system gets smarter over time. Tomorrow's tickets benefit from today's lessons.\n\n**Phase 5. Push.** Bash-script-enforced. Commits the changes, pushes to the deployment branch, monitors that the deploy succeeds. Our deploy time is under 120 seconds. Production is updated before a human would have finished reading the ticket.\n\nThere is no pull request. There is no human review. There is no staging environment that a human inspects. This is on purpose.\n\n> \"If you halfway these systems, you're going to turn some poor senior engineer's entire job into reviewing AI-generated PRs. You have to either YOLO the full way or just not at all.\"\n> — Hunter Hodnett, Chipp CTPO\n\n---\n\n## Part 5: The five pillars\n\nInside the cluster, five disciplines do the heavy lifting. Each of them has a dedicated post on this blog. The short version is here.\n\n### 1. Context engineering\n\nThe model has a finite paper. Everything you put on it, the system prompt, the tool definitions, the file contents you read, the search results, the past conversation, eats space. Run out of space and the system starts compacting (a small model summarizes the context window into a paragraph and replaces the original). Compaction is amnesia. Compaction is where good agents go to hallucinate.\n\nThe discipline is engineering what goes into the context window so the agent can finish its job before it runs out of paper. → *[Context Engineering: The Skill That Turns Claude Into a Production Co-Developer](/blog/context-engineering)*\n\n### 2. CLAUDE.md as scar tissue\n\nYour `CLAUDE.md` is the system prompt that loads in every session. Treat it as a scar tissue document, not an aspiration document. Every line in it is there because you got bitten by a real bug.\n\nWhen you see the agent make the same mistake three times, stop the session, tell Claude to add a rule to `CLAUDE.md`, and continue. After six months, your `CLAUDE.md` is a textbook of your codebase's hidden rules, and your agents stop making those mistakes forever.\n\nUse the hub-and-spoke model: a root `CLAUDE.md` plus subdirectory `CLAUDE.md`s that auto-load when files in those directories are read. → *[CLAUDE.md Architecture: A Hub-and-Spoke Pattern for Autonomous Codebases](/blog/claude-md-architecture)*\n\n### 3. Skills and sub-agents\n\nA **skill** is a chunk of text that loads into the current context window when it's relevant. Like a cheat sheet you carry with you while you work. Useful for things you need *while* coding, design system rules, brand voice, build conventions.\n\nA **sub-agent** is an entirely separate context window the main agent can spawn. Like sending someone to the library and waiting for them to come back with a one-paragraph summary. Useful for parallelizable work or for investigations that would otherwise pollute your main context.\n\nThe decision rule: *does the agent need this knowledge while working, or can it get an answer and come back?* → *[Skills vs Sub-Agents: When to Use Each in Claude Code](/blog/skills-vs-sub-agents)*\n\n### 4. MCP: Claude's senses\n\nThe Model Context Protocol is how the agent perceives the outside world. Without MCP, the agent is reasoning over its training data. With MCP, the agent can read your production logs, query your database, navigate your dev server, take screenshots of your UI.\n\nThe single highest-leverage MCP for autonomous development is the **browser MCP**. It's what lets the agent verify its own work. Build a custom one tied to your application, bake your dev login flow, your seed data, your test scenarios into the tools. The off-the-shelf browser MCPs are great to learn with; they're not what you'll run in production. → *[MCP Is the USB-C of AI: Building Your First MCP Server in 30 Minutes](/blog/building-your-first-mcp-server)*\n\n### 5. The bash harness\n\n> \"An autonomous agent without a bash harness is an intern with no manager, no deadline, and an unlimited API budget.\"\n> — Hunter Hodnett, Chipp CTPO\n\nClaude is non-deterministic. Your business needs to be deterministic. The bash harness is what reconciles those two.\n\nConcretely, our harness:\n\n- Spawns Claude with `claude -p --dangerously-skip-permissions`.\n- Watches the output stream and kills the session if no tool call fires for five minutes.\n- Greps for banned flags and aborts if it sees them, `git push --no-verify`, `git reset --hard`, `rm -rf`.\n- Forces a final `git commit` and `git push` if Claude forgets.\n- Cleans up orphaned dev server processes after each run.\n- Writes outcome labels (`clean` / `messy`) for each ticket so the data is ready for fine-tuning later.\n\n→ *[Building a Self-Healing Bug Bot: The Autonomous Dev System We Use at Chipp](/blog/self-healing-bug-bot)*\n\n---\n\n## Part 6: A real ticket, end to end\n\nEnough abstraction. Here's a real production ticket from last quarter.\n\n**The error.** An HTTP 500 on `/api/billing` for customers without a default payment method. Stripe's `create_customer` call failed; we weren't catching the failure; the user got a blank page.\n\n**The trigger.** Loki ingested 47 instances within five minutes. Grafana's alert rule fired. Webhook hit Bug Bot.\n\n**The pipeline, twenty tool calls, in order:**\n\n1. Read `CLAUDE.md` from the root.\n2. Read `docs/billing.md` (auto-loaded by the doc retrieval phase because the ticket mentioned billing).\n3. Pull latest staging branch.\n4. Read `src/api/routes/billing/index.ts`, the file in the stack trace.\n5. `bash: rg \"create_customer\"`, find every call site.\n6. Read `src/services/stripe.ts`, the service wrapper.\n7. Query Loki for similar errors over the last seven days. Result: started 90 minutes ago.\n8. Query Loki for the deploy that landed 90 minutes ago. A change to default payment method handling.\n9. Query the production database for affected customer rows.\n10. Form hypothesis: missing null-check on `customer.default_payment_method`.\n11. Edit `src/services/stripe.ts` to add the null-check and a friendly fallback.\n12. Run unit tests for that file. Pass.\n13. Run the full unit test suite. Pass.\n14. Spin up dev server on port 5184 (worker 4).\n15. Open Chromium via browser MCP.\n16. Navigate to `localhost:5184/billing`.\n17. Click \"Buy Credits\" button.\n18. Take screenshot. Read browser console logs. No errors.\n19. Verify HTTP 402 response with friendly toast message.\n20. Commit. Push. Deploy.\n\n**The result.** A bug that would have taken a junior engineer most of a workday, read the stack trace, find the bad commit, write the fix, test it, ship it, closed in twelve minutes by a process. Cost: about a coffee.\n\nThis is what's actually running. Nothing in this section is theoretical. The ticket above is one of about 25 we ship to production in an average day.\n\n> \"Junior engineer's day done by a process in 12 minutes.\"\n> — Hunter Hodnett, Chipp CTPO\n\n### A note on anti-fragility\n\nWhile I was teaching a recent cohort session, Stripe went down for three hours. Half the AI industry's billing infrastructure went silent. The other half's customers couldn't buy.\n\nWe barely noticed. Six weeks earlier, I had pointed Bug Bot at \"build us a redundant usage-based billing system that runs alongside Stripe.\" It chewed through the work over a weekend. The system had been sitting in production, in shadow mode, ever since.\n\nWhen Stripe died, I flipped a feature flag. Our billing kept running. Our customers kept buying.\n\nI bring this up not as a brag but as a description of what changes when your engineering capacity stops being a function of headcount. Redundancies that would have been \"nice-to-haves with a six-month timeline\" become \"weekend tickets for an idle worker.\" Anti-fragility stops being a Talebian aspiration and starts being a default property of how you build software.\n\n---\n\n## Part 7: Why we're building Alchemist\n\nWhen Claude Opus 4 dropped, I started using it for live coding sessions. I was sitting at my desk one afternoon, watching lines of code scroll up my screen, code I hadn't typed, code that was fixing things I hadn't asked it to fix yet. My wife came down and asked me what I was actually doing for work anymore. I didn't have an answer.\n\nThat was the moment I realized the entire \"AI coding assistant\" framing was wrong. The AI wasn't an assistant. It was the engineer. I was the manager.\n\nThe system you've been reading about is what I built between that afternoon and today. Three years of mistakes compressed into six months of acceleration once the model could carry the weight.\n\nWhen we showed an early version of it to founders in our Technical Agents of Change cohort, the question kept coming back:\n\n*\"Can we have it?\"*\n\nNot \"can we license it.\" Not \"can we read the code.\" *Can we have a starter version of this whole apparatus, configured and ready, that we can run on day one?*\n\nSo we're building it. We're calling it **Alchemist**, and we're rolling it out as the first commercial implementation of autonomous development as a service.\n\nHere's what Alchemist gives you on day one:\n\n- A deployed, opinionated stack. Deno on the server, Svelte on the client, Chromium for verification, Cloudflare Workers for hosting. We picked this combination after months of testing because it has the highest LLM training-data density of any modern web stack. Your agents will hallucinate less and ship more in this stack than in any other.\n- A configured Claude Code cluster, harness, and verification loop. Pre-wired `CLAUDE.md`, browser MCP, log-drain MCP, and a bash harness that will not let your tokens run away.\n- A `/docs/` folder that builds itself as you ship.\n- A self-healing pipeline. Errors in production trigger autonomous fixes.\n- An integrated customer support agent that escalates real frustration to a human (us, during the alpha).\n\nAnd the most important part: **the eject mechanism.**\n\nWhen you create a project on Alchemist, we create a GitHub repo for you. Every commit Alchemist's workers make goes there. If you decide one day to take your code to a private cloud, save on hosting, or just leave, you take the repo and go. Nothing is locked behind our infrastructure. The Alchemist Deployment Workflow runs on your repo, and you can fork it. We learned a long time ago that the only way to build a sticky business is to make sure your customers are *choosing* to stay every month, not trapped into staying.\n\nThe closed alpha is starting now. If you want in, the waitlist is at the top of this page.\n\n---\n\n## Part 8: Who this is for (and who it isn't)\n\nThis is not for everyone, and I'd rather lose your attention now than later.\n\n**This is for technical builders ready to lead a fleet of agents.** People who have shipped software, who can read a stack trace, who know what a tool call is. People who are excited, not threatened, by what AI can do with a codebase.\n\n**This is for builders willing to delete their PR review process.** I'm serious. If you're attached to \"every change gets reviewed by a human before it ships,\" you're going to half-implement what we describe in this post. Half-implementations are worse than no implementation. Half-implementations turn senior engineers into AI-PR-review janitors and ship slower than the manual baseline. Either YOLO the full way or stay where you are.\n\n**This is for teams of one to ten trying to compete with teams of fifty to five hundred.** The economics of autonomous development bend hardest in your favor when you're capital-constrained and ambition-rich. We bootstrapped Chipp's autonomous cluster because we had two engineers and a roadmap that wanted twenty.\n\n**This is not for engineers who think AI should stay out of the way.** I respect the position. I disagree with it. Either way, this isn't the post for you.\n\n**This is not for non-technical founders looking for a no-code shortcut.** Alchemist is not Lovable. You will write `CLAUDE.md` files. You will read stack traces. You will need to know what a context window is. Our value proposition is *do less of the typing and verification yourself*, not *never touch the code*.\n\n**This is not for organizations running on quarterly review cycles.** The pace of frontier AI doesn't wait for your governance committee. If your decision-making cadence is slower than a model release cycle, you can't run an autonomous dev cluster. You can experiment with one, but you can't run one in production.\n\nIf you've made it this far and you're still nodding, you're our reader. Welcome.\n\n---\n\n## Part 9: What's coming\n\nOver the next eight weeks, we're publishing the operating manual for autonomous development. One piece of the cluster per post:\n\n- **Context Engineering**: managing the finite paper.\n- **CLAUDE.md Architecture**: the hub-and-spoke pattern, scar tissue practice, auto-load tables.\n- **Skills vs Sub-Agents**: when to use which.\n- **MCP**: building the senses your agents need.\n- **The Bash Harness**: turning Claude into a teammate with commit access.\n- **Multi-agent Pipelines**: serial handoffs, sub-agent dilution, self-healing loops.\n- **Agentic Design Patterns**: the seven we battle-tested at Chipp.\n- **Vibe Coding vs Autonomous Development**: the maturity curve, demystified.\n- **Beyond AI Pair Programming**: from Copilot to Coworker to Autonomous Engineer.\n\nIf you want this in your inbox, the waitlist signup at the top of this page also subscribes you to the series.\n\n---\n\n## The bottom line\n\nSoftware development is being rewritten. Three things will happen over the next twenty-four months:\n\n1. A few teams will adopt autonomous development and ship 30 features a day at a fraction of payroll cost.\n2. Most teams will halfway it, turn their senior engineers into AI PR reviewers, and end up worse off than before they started.\n3. A handful of teams will resist entirely and watch their competitors lap them.\n\nWhich group your business ends up in is being decided right now.\n\nWe're betting on group one. We're building Alchemist for the people who are.\n\nWill your engineering team ship the next decade of your product, or will an autonomous cluster ship it for you?\n\nIf the answer is the cluster, the question is whose.\n\n**[Join the Alchemist waitlist →](/#waitlist)**\n\n---\n\n*Hunter Hodnett is co-founder and CTPO of [Chipp](https://chipp.ai), the platform behind 105,000+ AI agents reaching 89M+ end users. He previously engineered at Reddit, Amazon, and The Home Depot, and trained 290+ engineers as a coding bootcamp instructor. Alchemist is Chipp's autonomous development product, currently in closed alpha. The waitlist is open at [adaas.dev](https://adaas.dev).*",
      "date_published": "2026-04-29T00:00:00.000Z",
      "authors": [
        {
          "name": "Hunter Hodnett"
        }
      ],
      "tags": [
        "autonomous-development",
        "adaas",
        "autonomous-software-development",
        "manifesto"
      ]
    },
    {
      "id": "https://adaas.dev/blog/what-is-adaas",
      "url": "https://adaas.dev/blog/what-is-adaas",
      "title": "What Is Autonomous Development as a Service (ADaaS)?",
      "summary": "ADaaS is a category, not a feature. It productizes the engineering practice itself, the pipeline, the harness, the verification loop, the documentation system, the whole apparatus. Here's what you're actually buying, what it isn't, how to evaluate a provider, and why this is the next -aaS to reshape an industry.",
      "content_text": "Every \"as a Service\" has done the same thing. SaaS productized software so you stopped buying CD-ROMs. PaaS productized application platforms so you stopped racking servers. IaaS productized infrastructure so you stopped buying hardware at all.\n\nEach one moved a layer of the stack from a thing you *built* to a thing you *rented*. Each one looked, to the people running the prior layer, like a strange and possibly bad idea. Each one absorbed an entire industry inside a decade.\n\nADaaS is the next one. **Autonomous development as a service** productizes the engineering practice itself, the pipeline, the harness, the verification loop, the documentation system, the whole apparatus that lets you point at a goal and get shipped software. You don't rent a tool. You don't rent a model. You rent a *process*.\n\nThis post defines the category, separates it from the things it gets confused with, and lays out a buyer's framework for evaluating ADaaS providers, including how to evaluate the one we're building.\n\n## The lineage\n\nPull the thread on every -aaS and you find the same arc.\n\n**SaaS** turned software from a thing you bought (in a box) into a thing you logged into (in a browser). The category cracked open in 1999 with Salesforce and now houses ~$300B in revenue across thousands of companies. The unit of consumption became \"a seat per month\" instead of \"a license forever.\"\n\n**PaaS** turned application infrastructure from a thing you assembled (Apache + MySQL + your own load balancer) into a thing you deployed to (Heroku, Vercel). The unit of consumption became \"an app\" instead of \"a stack.\"\n\n**IaaS** turned the hardware itself from a thing you owned (a rack in a colo) into a thing you spun up (an EC2 instance). The unit of consumption became \"an hour of compute\" instead of \"a server lease.\"\n\nNow look at what's actually been left out of those layers: **the engineering practice that turns intent into shipped software.** That's still a thing every company builds for itself, with hired engineers, against a backlog. It's the largest unproductized layer left in the software value chain.\n\nADaaS is what productizes that.\n\nThe unit of consumption becomes \"a ticket shipped\" instead of \"an engineer-week.\" That's a category-defining shift. If the pattern holds, and the underlying economics suggest it will, most software companies in 2030 will buy this layer instead of building it, the same way most software companies in 2010 stopped racking servers.\n\n## What you're actually buying\n\nPeople hear \"autonomous development as a service\" and assume it's an AI coding tool with a marketing team. It isn't. The category is bigger and more boring than that.\n\nWhen you buy ADaaS, you're buying:\n\n**A configured agent cluster.** Not Claude or GPT, the *cluster* on top, with the bash harness, the worker pool, the parallelism management, the timeout enforcement. The thing that turns a non-deterministic model into a deterministic teammate. ([Why this matters →](/blog/self-healing-bug-bot))\n\n**A verification loop.** A browser MCP that knows how to spin up your dev server, navigate your routes, take screenshots, read console logs. Without this, the agent's writing code and hoping. With it, the agent's writing code and *checking*.\n\n**A context-engineering apparatus.** A `CLAUDE.md` you don't have to write from scratch. An auto-load table that knows what to load when. Sub-agent definitions for the parts of your codebase that need their own attention. ([The discipline behind it →](/blog/context-engineering))\n\n**A documentation system that builds itself.** Every successful ticket writes its own docs. Tomorrow's tickets are smarter than yesterday's because the system never forgets a lesson it already paid to learn.\n\n**A maintained MCP fleet.** Stripe, GitHub, Cloudflare, Loki, your database, all wired up, all kept current, all hardened against the failure modes the provider has already encountered.\n\n**A self-healing pipeline.** Errors in production trigger autonomous fixes. The on-call rotation goes empty. PagerDuty escalations stop happening because the cluster is faster than the page would have been.\n\n**An eject mechanism.** This one matters. The whole stack should be portable, your code lives in your repo, your data in your buckets, your secrets in your vault. If the provider raises prices into a corner, you take your code and walk. ADaaS without an eject mechanism is platform lock-in disguised as a service.\n\nWhat you're *not* buying:\n\n- A model. The model is upstream of the cluster. You're buying the apparatus that uses the model.\n- A code generator. Code generation is one tool call inside the pipeline. The pipeline is what makes generation reliable enough to ship.\n- An IDE. ADaaS is unattended by definition. There's no human in the inner loop, so there's no IDE in the inner loop either.\n- A no-code builder. ADaaS expects you can read a stack trace. If you can't, the category isn't for you.\n\n## How ADaaS is different from what came before\n\nThe clearest way to understand a new category is by what it isn't.\n\n**ADaaS vs Cursor / Copilot.** Cursor and Copilot are *human-in-the-loop* tools. The human is at the keyboard; the AI augments the human. ADaaS removes the human from the inner loop. The human is at the *meta*-loop, defining tickets, reviewing outcomes, making judgment calls. The AI is doing the engineering.\n\n**ADaaS vs Devin / Cognition.** Devin is the closest competitor in the category, and the comparison is fair. The differences are stack-specific (we tune for one stack and ship a polished cluster around it; Devin works across many) and posture-specific (we ship straight to production with no PR; Devin defaults to a human review gate). Both are valid. Both are autonomous development. ADaaS as a category covers both.\n\n**ADaaS vs Lovable / Bolt / v0.** These are *vibe-coding-as-a-service* products, marketed at non-technical builders. They're great for prototypes, demos, and weekend projects. They are not autonomous development. They run a model, they show the user the diff, they let the user accept or reject. The human is in the loop on every change. ADaaS removes that loop.\n\n**ADaaS vs OpenAI's Codex / Anthropic's Claude Code.** These are the *primitives* on which ADaaS is built. They're powerful interactive tools. Hunter writes most of his daily code in Claude Code, but they require a person at the keyboard to drive each session. ADaaS is what you build *on top of* these primitives to remove the person.\n\n**ADaaS vs your in-house engineering team.** This is the comparison that matters most for buyers. An in-house team has organizational context an outside service can't have. An ADaaS provider has accumulated infrastructure your in-house team would need to rebuild. The right answer for most companies in the next 36 months is *both*, your team focuses on architecture, judgment, and the parts of the codebase that need real human eyes; the ADaaS cluster handles the volume of routine tickets that would otherwise drown a small team.\n\n> \"ADaaS removes the human from the inner loop. That's the line that separates this category from everything else.\"\n> — Hunter Hodnett, Chipp CTPO\n\n## Why this is happening now\n\nThree things had to be true at the same time for ADaaS to be a real category. They became simultaneously true in late 2025.\n\n**1. Models capable enough to ship code.** Claude Opus 4 was the inflection point. Below that capability bar, you can't trust the agent to write production code without supervision. Above it, you can. The bar moved up sharply with Opus 4 and has stayed up.\n\n**2. Tool surface rich enough to verify.** The Model Context Protocol gave us a standard for browser control, log queries, database access, screenshot capture. Without verification, you can't autonomy. With it, you can.\n\n**3. Context-engineering practice mature enough to scale.** The discipline of *what to put in front of the model so it doesn't compact, hallucinate, or get lost* didn't exist as a named field two years ago. It exists now. ([The full discipline →](/blog/context-engineering))\n\nWhen all three are simultaneously true, autonomous development becomes possible. When it becomes possible at one company, the rest of the industry has 18 to 36 months before they have to either adopt or get outpaced. We're in month four.\n\n## The buyer's framework\n\nIf you're considering ADaaS for your team, here's the framework I'd run a provider through. Most don't pass. The ones that do are the ones worth paying for.\n\n### Question 1: What does your verification loop look like?\n\nIf the answer is \"we run tests after generation,\" that's vibe coding with a CI bolt-on. Pass.\n\nIf the answer is \"we open a browser, navigate to the affected page, take a screenshot, and read the console logs,\" that's autonomous development. Continue.\n\nThe verification loop is the single highest-leverage component. Without it, the cluster is guessing.\n\n### Question 2: How do you handle context overflow?\n\nIf the answer is \"we have a 1M token context, you'll be fine,\" they don't understand the problem. Compaction will eat them.\n\nIf the answer is \"we use a multi-stage pipeline, each stage gets a fresh context window, stages communicate via markdown files written to disk,\" they get it.\n\n### Question 3: What's your eject mechanism?\n\nIf the answer is \"you log in to our platform and use our IDE,\" you're locked in.\n\nIf the answer is \"your code is in your GitHub repo, our deployment workflow runs in your CI, you can stop paying us tomorrow and keep everything we built for you,\" they pass.\n\nThis is the deal-breaker for me. ADaaS without an eject is just SaaS lock-in with extra steps.\n\n### Question 4: Where does the system get smarter?\n\nIf the answer is \"as the model gets better,\" the provider is a passthrough. Your investment in them doesn't compound, you're paying for someone else's improvements.\n\nIf the answer is \"the documentation our cluster writes for your codebase compounds; the scar-tissue rules in your `CLAUDE.md` accumulate; the outcome-labeled training data we archive becomes the basis for fine-tuning,\" they're building you a moat.\n\n### Question 5: Show me the failure modes\n\nAny honest provider should be able to tell you exactly when their cluster fails and why. If they can't, they haven't run it long enough to know. If they can but won't, they're hiding something.\n\nOur cluster fails on cross-tool integrations (third-party APIs the agent can't directly observe), on decomposition of large features, on truly novel work that has no analog in our codebase. We say so out loud. We'd be suspicious of any provider that doesn't.\n\n## Why we coined the term\n\nYou'll notice we've been using \"ADaaS\" throughout this post like it's an established category name. It isn't. We started using it in early 2026. As of this writing, the search volume is low and the SERP is mostly dental-school applications.\n\nWe could have used \"AI coding agent platform\" or \"autonomous coding service\" or any of a dozen existing labels. We didn't, because none of them name the thing we're actually selling. The category needs a name. Categories that don't have names don't get bought.\n\nSo: ADaaS. Autonomous development as a service. Coined here, defined here, with the working understanding that the rest of the industry will catch up to it within 12 to 18 months because the underlying economic logic is too powerful to stay underground.\n\nIf we're wrong about the category, the term dies. If we're right, you'll be hearing it a lot.\n\n> \"Categories that don't have names don't get bought. We named ours.\"\n> — Hunter Hodnett, Chipp CTPO\n\n## Where Alchemist sits\n\nAlchemist is our implementation of the category, the configured cluster, the verification loop, the documentation system, the eject mechanism, the maintained MCP fleet. The same system that runs Chipp's autonomous engineering ([the system I described in the manifesto](/blog/autonomous-development)) is what we're packaging.\n\nWe expect competitors. We're going to write about them when they ship; the category is bigger than any one company can capture, and the work of category-defining is mostly happens in public, in posts like this one.\n\nWhat we're going to differentiate on:\n\n- **Stack opinionation.** We picked one stack (Deno, Svelte, Cloudflare) and tuned the cluster against it. Generality is a tax. Specialization is a moat.\n- **Eject by default.** Your code is in your repo from day one. We win because you choose to stay, not because you can't leave.\n- **Bundle of practice, not just product.** The cluster comes with the disciplines that make it work, `CLAUDE.md` patterns, sub-agent definitions, verification loops, doc system. You inherit two years of practice the moment you sign up.\n\nThe closed alpha is starting now.\n\n## The bottom line\n\nSoftware development is the largest unproductized layer left in the software value chain. ADaaS is what productizes it. The pattern matches every prior -aaS revolution, the technology is suddenly capable, and the disciplines for using it are no longer secret.\n\nThe teams that adopt early will build software at engineering capacities a team an order of magnitude larger could not match. The teams that hold out will spend the next 24 months watching that gap widen.\n\nIf you've read this and you're nodding, you're our buyer.\n\n**[Join the Alchemist waitlist →](/#waitlist)**\n\n---\n\nIf you want the long-form case for autonomous development, start with [The Autonomous Development Manifesto](/blog/autonomous-development).\n\nIf you want to see what an ADaaS implementation looks like under the hood, read [Building a Self-Healing Bug Bot](/blog/self-healing-bug-bot).",
      "date_published": "2026-04-28T00:00:00.000Z",
      "authors": [
        {
          "name": "Hunter Hodnett"
        }
      ],
      "tags": [
        "adaas",
        "autonomous-development",
        "autonomous-software-development",
        "software-as-a-service",
        "foundations"
      ]
    },
    {
      "id": "https://adaas.dev/blog/what-is-autonomous-software-development",
      "url": "https://adaas.dev/blog/what-is-autonomous-software-development",
      "title": "What is Autonomous Software Development?",
      "summary": "A working definition of autonomous software development, how it differs from coding assistants like Copilot, and where the technology actually delivers today.",
      "content_text": "**Autonomous software development** is a class of software engineering where an AI agent — not a human — owns the loop from problem statement to merged code. The human writes a description of what they want; the agent reads the codebase, edits files, runs tests, reviews its own diff, and pushes the result. No keystroke-by-keystroke supervision. No copy-paste from a chat window.\n\nThis is a meaningfully different posture from the AI coding assistants most engineers use day-to-day. It's worth pinning down what the term actually means before unpacking how it works in practice.\n\n## A working definition\n\nAn autonomous software development system has four properties:\n\n1. **A bounded task input.** Usually a GitHub issue, a Slack message, a customer support ticket, or a natural-language description. The input names a goal, not a sequence of steps.\n2. **Tool-mediated execution.** The agent works through a constrained interface — `read_file`, `write_file`, `run_tests`, `push_changes` — rather than free-form output. Tools provide accountability and a place to enforce safety.\n3. **A self-terminating loop.** The agent decides when it's done. Either it pushes a commit, or it exhausts a turn budget and reports failure. There is no human in the middle of the loop.\n4. **An auditable result.** Every decision the agent made is recoverable from the conversation log, the tool calls, and the final diff. Someone can read the trail and decide to merge, revise, or reject.\n\nIf any of those is missing, you have something else. A system that requires a human to pick the next file to edit is an assistant. A system that emits code without running it is a generator. A system without a clear stop condition is a research demo.\n\n## How it differs from coding assistants\n\nGitHub Copilot, Cursor's autocomplete, and most IDE-resident AI features operate at the keystroke or block level. They predict the next token given the current cursor position. The human is doing the engineering — choosing which file to open, which abstraction to introduce, when the work is done. The AI is making the typing faster.\n\nAutonomous systems invert that ratio. The human writes a paragraph; the AI does the engineering. That shift matters because the bottleneck on most software teams is not how fast people can type — it's how much *coordinated cognitive load* a single change requires. Reading the existing code. Understanding the bug. Writing the fix. Writing the test. Reviewing the diff. Pushing without breaking anything else. That whole stack is what an autonomous system is trying to absorb.\n\n## How it works in practice\n\nConcretely, every modern autonomous coding system looks roughly the same on the inside:\n\n```\nIssue → System prompt + tools → LLM tool-use loop → Push\n```\n\nThe agent runs in a sandbox — a container, an E2B microVM, sometimes a fresh Kubernetes Job per ticket. Inside the sandbox, the repo is cloned, a model like Claude or GPT-4-class is given the issue plus a tool schema, and the model decides what to do next. Each model turn either calls a tool or stops. The framework executes the tool, feeds the result back, and asks the model what's next.\n\nMost systems put a self-review step before push: the agent runs `git diff` on its own work, critiques it as a separate model call, and either fixes the issues it finds or proceeds. This single trick — letting the model see its own work before committing — is one of the largest quality wins in the space.\n\n## What it can do today\n\nAutonomous systems work well on:\n\n- **Well-scoped bug fixes.** \"Date parsing breaks when the timezone is null\" — clear input, finite blast radius, a test that proves the fix.\n- **Mechanical refactors.** Rename a symbol across a repo. Migrate a library. Update an API call site.\n- **Feature scaffolding.** \"Add a new resource called `Project` with the same CRUD shape as `Tenant`\" — the pattern already exists in the codebase, the agent just instantiates it.\n- **Tests written from existing behavior.** Pin down what the code does today so it can be refactored tomorrow.\n\nWhat they're not yet good at — and where the field is actively working — is anything that requires *holding the whole system in your head*. Cross-cutting performance work. Ambiguous product decisions. Architectural changes that touch every layer at once. Those still need a human.\n\n## Why this matters\n\nAutonomous software development changes the unit of engineering work from \"a person-day\" to \"a ticket\". The ticket is the smallest thing you can hand off. If a system can reliably absorb tickets, then the question stops being \"how many engineers do we have\" and starts being \"how many tickets can we describe well enough to ship\". The bottleneck moves up the stack — to specification, to testing, to code review.\n\nThis is what we're building Alchemist around. Tickets in, code out, with a transparent trail of every decision the agent made along the way. We'll be writing about the parts that turned out to be harder than they looked — sandboxing, billing per turn, the model's tendency to over-edit, what self-review actually catches — in posts that follow.\n\nIf you want to be early, [join the waitlist](/#waitlist). If you want to be very early, the engineering posts are the place to read along.",
      "date_published": "2026-04-27T00:00:00.000Z",
      "authors": [
        {
          "name": "The Alchemist team"
        }
      ],
      "tags": [
        "autonomous-software-development",
        "ai-agents",
        "software-engineering"
      ]
    }
  ]
}