# The Autonomous Development Manifesto: Why We're Building Alchemist

> An engineering team of two. 20–30 production deploys a day. No PR queue, no on-call rotation. This is the case for autonomous development, the discipline, the architecture, the maturity curve, and ADaaS as a category. Plus the system we run at Chipp and the product we're building from it.

Software development is changing. Again.

For fifty years, software was written by humans typing code one keystroke at a time. For the last five, by humans typing code while a model whispered the next token. We told ourselves the model was an assistant, that the human was still the author.

We were wrong about how much of the job could be automated. Not just the typing. The reading. The reasoning. The verification. The shipping.

I run an engineering team of two people. We push 20 to 30 changes to production every day. The pull request queue is empty because there are no pull requests. The on-call rotation is empty because there is no on-call rotation. I have slept through the night for the last two months.

This isn't a vibe-coding demo. This isn't "AI is going to change everything someday." This is a description of the system that built and runs Chipp, the platform behind 105,000+ AI agents reaching 89M+ end users. The system we call Bug Bot. The system that, over the next year, we are productizing as **Alchemist**, and offering as **autonomous development as a service**.

This post is for the engineers who suspect this is real but haven't seen it for themselves.

It's also for the engineers who think they're better than AI. You are. For now. The window in which that's true is closing faster than most people realize.

Let me show you what's on the other side.

---

## Part 1: What "autonomous development" actually means

The terms are getting thrown around and they're not all the same thing.

**Autocomplete.** A model finishes your line. You're still the author.

**AI pair programmer.** A model writes a function. You read it, edit it, commit it. Copilot's original pitch.

**Vibe coding.** A model writes most of the code. You review the diff, occasionally course-correct, ship the result. The human is still in the loop on every step.

**Agentic coding.** The model executes tools, runs commands, reads files, calls APIs, to accomplish a goal. The human still drives the session.

**Autonomous development.** The model is given a goal. It selects its own steps, executes them, verifies the results, and ships. There is no human in the loop on individual steps. There may not be a human in the loop on the *outcome* either, if you set it up that way.

The distinction that matters most: **what happens when the model writes wrong code?**

In vibe coding, you catch it in the diff and tell the model to try again.

In autonomous development, the model catches it itself, by spinning up a dev server, opening a browser, navigating to the changed page, taking a screenshot, reading the console logs, noticing the error, and fixing it.

That last sentence is the entire game. Everything else in this post is mechanical detail about how to get there.

> "Claude writing code without verification is a liability. Claude writing code and verifying and pushing to prod is a teammate with commit access."
> — Hunter Hodnett, Chipp CTPO

---

## Part 2: The maturity curve

We're somewhere in the middle of a five-stage industry shift. Where you sit on this curve will determine whether your business compounds or capsizes over the next twenty-four months.

**Stage 1. Acceleration.** Faster typing. Slightly fewer keystrokes per shipped feature.

**Stage 2. Augmentation.** Models write the boring code. Humans write the novel code and review everything. The senior engineer still does most of the thinking.

**Stage 3. Vibe coding.** Models do most of the typing. Humans become editors and product designers in the loop. Output velocity 5–10x. Demos are great. Production code is hit or miss. Most builders are stuck here today.

**Stage 4. Agentic coding.** Models can run tools, read files, search the web, query databases, open browsers. With the right setup, they can verify their own work. The human still drives the session. Output velocity 20–50x for a session, but each session needs a human to launch and oversee.

**Stage 5. Autonomous development.** Models run unattended. Multiple agents in parallel. Goal-directed. Self-verifying. The human role is decomposition (turning intent into tickets) and judgment (reviewing outcomes). Output velocity is no longer the right metric, *organizational capacity* is. This is what we run. This is what Alchemist productizes.

Each stage doesn't replace the prior one. It absorbs it. Inside an autonomous development cluster, you'll find vibe-coding sessions, agentic coding loops, and AI pair programming. They're all in the box. The cluster is what wraps them in a deterministic harness.

If your team is at Stage 3 and your competitor is at Stage 5, your competitor isn't 10% faster. They're operating in a different category of business. Their feature velocity isn't bottlenecked by engineer count. Yours is.

---

## Part 3: ADaaS (Autonomous Development as a Service)

Here's the awkward part. Most teams should not build their own autonomous development cluster from scratch.

We did, because we had to. Three years of full-time work. Dozens of mistakes, including a few production database drops along the way. An immense amount of inference burned figuring out which prompts, harnesses, and architectures actually hold up under autonomous load.

The hard parts aren't where you'd expect them. They're not in the model. The model is fine. They're in:

- **The bash harness**: the deterministic wrapper that prevents your non-deterministic agents from running away with your token budget.
- **The verification loop**: getting the agent to actually look at the running software and see what it built.
- **The context engineering discipline**: knowing what to load, what to prune, what to summarize, what to never compact.
- **The CLAUDE.md scar-tissue practice**: turning every agent mistake into a rule that prevents that mistake forever.
- **The pipeline decomposition**: splitting one task across multiple context windows so no single window has to hold everything.

You will spend nine months learning these the hard way. We already did. So we are productizing it.

That's what **ADaaS** (autonomous development as a service) means. You don't rent a tool. You don't rent a model. You rent a *process*: a configured cluster, a hardened harness, a verification loop, a maintained set of MCP servers, a documentation auto-load system, the whole apparatus that lets you point at a goal and get shipped software.

The lineage:

- **SaaS** productized software.
- **PaaS** productized application platforms.
- **IaaS** productized infrastructure.
- **ADaaS** productizes the engineering practice itself.

We coined the term because nobody else had. Don't be surprised when the rest of the industry catches up to it within a year, the underlying economic logic is too powerful to stay underground.

---

## Part 4: Inside the autonomous dev cluster

Let me show you what's running on my desk right now.

I have eight Claude Code workers running in parallel on a single workstation. Each worker is an instance of `claude -p`. Claude Code's headless mode, wrapped in a bash script that monitors its output, kills it if it stalls, and forces a commit at the end.

Tickets enter the system through four trigger sources:

1. **Slack tag.** Anyone on my team can `@bug bot` in our internal channel. The mention payload becomes the prompt.
2. **Email forward.** I forward customer emails to a Bug Bot inbox. The email body becomes the prompt. I dictate most of mine via Whisper Flow on my phone, yes, I voice-message my engineering team.
3. **Grafana webhook.** Every five minutes, our log ingestion pipeline groups recent production errors by stack trace, dedupes them, and fires a webhook. The error context becomes the prompt.
4. **Performance alert rule.** When P95 chat-streaming latency exceeds three seconds, a worker spawns automatically.

When a ticket lands, it runs through five stages. **Each stage gets its own context window.** This is the most important architectural rule in the whole system.

```
[Trigger] → [Phase 0: Doc retrieval] → [Phase 1: Research]
         → [Phase 2: Implement]   → [Phase 3: Code review]
         → [Phase 4: Docs update] → [Phase 5: Push to prod]
```

**Phase 0. Doc retrieval.** Before Claude even sees the ticket, a bash script semantic-searches our `/docs/` folder for any markdown files relevant to the ticket. Those docs get prepended to the research prompt as context.

**Phase 1. Research.** Fresh context window. Claude reads relevant code, queries our log drain (Loki) for similar past errors, queries our production database for affected rows, and outputs a `plan.md` file describing the root cause and proposed fix. Sensitive fields in the database are one-way encrypted; the agent can pattern-match on metadata without ever seeing customer secrets.

**Phase 2. Implement.** Brand new context window. Claude is handed the `plan.md` (no other context) and writes the code. Runs the unit tests. Spins up a local dev server in a git worktree on a dedicated port. Opens Chromium via the browser MCP. Navigates to the affected page. Takes a screenshot. Reads the dev server logs and the browser console logs. If anything's wrong, it fixes itself.

**Phase 3. Code review.** Another fresh context window. This one gets the full diff and is told to red-team the implementation. It can edit. If it makes more than a handful of edits, the harness flags the original implementation as `messy`.

**Phase 4. Docs update.** Yet another fresh context. Given the diff, this agent writes new documentation into the `/docs/` folder for any non-obvious behavior the change introduces, and prunes any docs the change has invalidated. This is how the system gets smarter over time. Tomorrow's tickets benefit from today's lessons.

**Phase 5. Push.** Bash-script-enforced. Commits the changes, pushes to the deployment branch, monitors that the deploy succeeds. Our deploy time is under 120 seconds. Production is updated before a human would have finished reading the ticket.

There is no pull request. There is no human review. There is no staging environment that a human inspects. This is on purpose.

> "If you halfway these systems, you're going to turn some poor senior engineer's entire job into reviewing AI-generated PRs. You have to either YOLO the full way or just not at all."
> — Hunter Hodnett, Chipp CTPO

---

## Part 5: The five pillars

Inside the cluster, five disciplines do the heavy lifting. Each of them has a dedicated post on this blog. The short version is here.

### 1. Context engineering

The model has a finite paper. Everything you put on it, the system prompt, the tool definitions, the file contents you read, the search results, the past conversation, eats space. Run out of space and the system starts compacting (a small model summarizes the context window into a paragraph and replaces the original). Compaction is amnesia. Compaction is where good agents go to hallucinate.

The discipline is engineering what goes into the context window so the agent can finish its job before it runs out of paper. → *[Context Engineering: The Skill That Turns Claude Into a Production Co-Developer](/blog/context-engineering)*

### 2. CLAUDE.md as scar tissue

Your `CLAUDE.md` is the system prompt that loads in every session. Treat it as a scar tissue document, not an aspiration document. Every line in it is there because you got bitten by a real bug.

When you see the agent make the same mistake three times, stop the session, tell Claude to add a rule to `CLAUDE.md`, and continue. After six months, your `CLAUDE.md` is a textbook of your codebase's hidden rules, and your agents stop making those mistakes forever.

Use the hub-and-spoke model: a root `CLAUDE.md` plus subdirectory `CLAUDE.md`s that auto-load when files in those directories are read. → *[CLAUDE.md Architecture: A Hub-and-Spoke Pattern for Autonomous Codebases](/blog/claude-md-architecture)*

### 3. Skills and sub-agents

A **skill** is a chunk of text that loads into the current context window when it's relevant. Like a cheat sheet you carry with you while you work. Useful for things you need *while* coding, design system rules, brand voice, build conventions.

A **sub-agent** is an entirely separate context window the main agent can spawn. Like sending someone to the library and waiting for them to come back with a one-paragraph summary. Useful for parallelizable work or for investigations that would otherwise pollute your main context.

The decision rule: *does the agent need this knowledge while working, or can it get an answer and come back?* → *[Skills vs Sub-Agents: When to Use Each in Claude Code](/blog/skills-vs-sub-agents)*

### 4. MCP: Claude's senses

The Model Context Protocol is how the agent perceives the outside world. Without MCP, the agent is reasoning over its training data. With MCP, the agent can read your production logs, query your database, navigate your dev server, take screenshots of your UI.

The single highest-leverage MCP for autonomous development is the **browser MCP**. It's what lets the agent verify its own work. Build a custom one tied to your application, bake your dev login flow, your seed data, your test scenarios into the tools. The off-the-shelf browser MCPs are great to learn with; they're not what you'll run in production. → *[MCP Is the USB-C of AI: Building Your First MCP Server in 30 Minutes](/blog/building-your-first-mcp-server)*

### 5. The bash harness

> "An autonomous agent without a bash harness is an intern with no manager, no deadline, and an unlimited API budget."
> — Hunter Hodnett, Chipp CTPO

Claude is non-deterministic. Your business needs to be deterministic. The bash harness is what reconciles those two.

Concretely, our harness:

- Spawns Claude with `claude -p --dangerously-skip-permissions`.
- Watches the output stream and kills the session if no tool call fires for five minutes.
- Greps for banned flags and aborts if it sees them, `git push --no-verify`, `git reset --hard`, `rm -rf`.
- Forces a final `git commit` and `git push` if Claude forgets.
- Cleans up orphaned dev server processes after each run.
- Writes outcome labels (`clean` / `messy`) for each ticket so the data is ready for fine-tuning later.

→ *[Building a Self-Healing Bug Bot: The Autonomous Dev System We Use at Chipp](/blog/self-healing-bug-bot)*

---

## Part 6: A real ticket, end to end

Enough abstraction. Here's a real production ticket from last quarter.

**The error.** An HTTP 500 on `/api/billing` for customers without a default payment method. Stripe's `create_customer` call failed; we weren't catching the failure; the user got a blank page.

**The trigger.** Loki ingested 47 instances within five minutes. Grafana's alert rule fired. Webhook hit Bug Bot.

**The pipeline, twenty tool calls, in order:**

1. Read `CLAUDE.md` from the root.
2. Read `docs/billing.md` (auto-loaded by the doc retrieval phase because the ticket mentioned billing).
3. Pull latest staging branch.
4. Read `src/api/routes/billing/index.ts`, the file in the stack trace.
5. `bash: rg "create_customer"`, find every call site.
6. Read `src/services/stripe.ts`, the service wrapper.
7. Query Loki for similar errors over the last seven days. Result: started 90 minutes ago.
8. Query Loki for the deploy that landed 90 minutes ago. A change to default payment method handling.
9. Query the production database for affected customer rows.
10. Form hypothesis: missing null-check on `customer.default_payment_method`.
11. Edit `src/services/stripe.ts` to add the null-check and a friendly fallback.
12. Run unit tests for that file. Pass.
13. Run the full unit test suite. Pass.
14. Spin up dev server on port 5184 (worker 4).
15. Open Chromium via browser MCP.
16. Navigate to `localhost:5184/billing`.
17. Click "Buy Credits" button.
18. Take screenshot. Read browser console logs. No errors.
19. Verify HTTP 402 response with friendly toast message.
20. Commit. Push. Deploy.

**The result.** A bug that would have taken a junior engineer most of a workday, read the stack trace, find the bad commit, write the fix, test it, ship it, closed in twelve minutes by a process. Cost: about a coffee.

This is what's actually running. Nothing in this section is theoretical. The ticket above is one of about 25 we ship to production in an average day.

> "Junior engineer's day done by a process in 12 minutes."
> — Hunter Hodnett, Chipp CTPO

### A note on anti-fragility

While I was teaching a recent cohort session, Stripe went down for three hours. Half the AI industry's billing infrastructure went silent. The other half's customers couldn't buy.

We barely noticed. Six weeks earlier, I had pointed Bug Bot at "build us a redundant usage-based billing system that runs alongside Stripe." It chewed through the work over a weekend. The system had been sitting in production, in shadow mode, ever since.

When Stripe died, I flipped a feature flag. Our billing kept running. Our customers kept buying.

I bring this up not as a brag but as a description of what changes when your engineering capacity stops being a function of headcount. Redundancies that would have been "nice-to-haves with a six-month timeline" become "weekend tickets for an idle worker." Anti-fragility stops being a Talebian aspiration and starts being a default property of how you build software.

---

## Part 7: Why we're building Alchemist

When Claude Opus 4 dropped, I started using it for live coding sessions. I was sitting at my desk one afternoon, watching lines of code scroll up my screen, code I hadn't typed, code that was fixing things I hadn't asked it to fix yet. My wife came down and asked me what I was actually doing for work anymore. I didn't have an answer.

That was the moment I realized the entire "AI coding assistant" framing was wrong. The AI wasn't an assistant. It was the engineer. I was the manager.

The system you've been reading about is what I built between that afternoon and today. Three years of mistakes compressed into six months of acceleration once the model could carry the weight.

When we showed an early version of it to founders in our Technical Agents of Change cohort, the question kept coming back:

*"Can we have it?"*

Not "can we license it." Not "can we read the code." *Can we have a starter version of this whole apparatus, configured and ready, that we can run on day one?*

So we're building it. We're calling it **Alchemist**, and we're rolling it out as the first commercial implementation of autonomous development as a service.

Here's what Alchemist gives you on day one:

- A deployed, opinionated stack. Deno on the server, Svelte on the client, Chromium for verification, Cloudflare Workers for hosting. We picked this combination after months of testing because it has the highest LLM training-data density of any modern web stack. Your agents will hallucinate less and ship more in this stack than in any other.
- A configured Claude Code cluster, harness, and verification loop. Pre-wired `CLAUDE.md`, browser MCP, log-drain MCP, and a bash harness that will not let your tokens run away.
- A `/docs/` folder that builds itself as you ship.
- A self-healing pipeline. Errors in production trigger autonomous fixes.
- An integrated customer support agent that escalates real frustration to a human (us, during the alpha).

And the most important part: **the eject mechanism.**

When you create a project on Alchemist, we create a GitHub repo for you. Every commit Alchemist's workers make goes there. If you decide one day to take your code to a private cloud, save on hosting, or just leave, you take the repo and go. Nothing is locked behind our infrastructure. The Alchemist Deployment Workflow runs on your repo, and you can fork it. We learned a long time ago that the only way to build a sticky business is to make sure your customers are *choosing* to stay every month, not trapped into staying.

The closed alpha is starting now. If you want in, the waitlist is at the top of this page.

---

## Part 8: Who this is for (and who it isn't)

This is not for everyone, and I'd rather lose your attention now than later.

**This is for technical builders ready to lead a fleet of agents.** People who have shipped software, who can read a stack trace, who know what a tool call is. People who are excited, not threatened, by what AI can do with a codebase.

**This is for builders willing to delete their PR review process.** I'm serious. If you're attached to "every change gets reviewed by a human before it ships," you're going to half-implement what we describe in this post. Half-implementations are worse than no implementation. Half-implementations turn senior engineers into AI-PR-review janitors and ship slower than the manual baseline. Either YOLO the full way or stay where you are.

**This is for teams of one to ten trying to compete with teams of fifty to five hundred.** The economics of autonomous development bend hardest in your favor when you're capital-constrained and ambition-rich. We bootstrapped Chipp's autonomous cluster because we had two engineers and a roadmap that wanted twenty.

**This is not for engineers who think AI should stay out of the way.** I respect the position. I disagree with it. Either way, this isn't the post for you.

**This is not for non-technical founders looking for a no-code shortcut.** Alchemist is not Lovable. You will write `CLAUDE.md` files. You will read stack traces. You will need to know what a context window is. Our value proposition is *do less of the typing and verification yourself*, not *never touch the code*.

**This is not for organizations running on quarterly review cycles.** The pace of frontier AI doesn't wait for your governance committee. If your decision-making cadence is slower than a model release cycle, you can't run an autonomous dev cluster. You can experiment with one, but you can't run one in production.

If you've made it this far and you're still nodding, you're our reader. Welcome.

---

## Part 9: What's coming

Over the next eight weeks, we're publishing the operating manual for autonomous development. One piece of the cluster per post:

- **Context Engineering**: managing the finite paper.
- **CLAUDE.md Architecture**: the hub-and-spoke pattern, scar tissue practice, auto-load tables.
- **Skills vs Sub-Agents**: when to use which.
- **MCP**: building the senses your agents need.
- **The Bash Harness**: turning Claude into a teammate with commit access.
- **Multi-agent Pipelines**: serial handoffs, sub-agent dilution, self-healing loops.
- **Agentic Design Patterns**: the seven we battle-tested at Chipp.
- **Vibe Coding vs Autonomous Development**: the maturity curve, demystified.
- **Beyond AI Pair Programming**: from Copilot to Coworker to Autonomous Engineer.

If you want this in your inbox, the waitlist signup at the top of this page also subscribes you to the series.

---

## The bottom line

Software development is being rewritten. Three things will happen over the next twenty-four months:

1. A few teams will adopt autonomous development and ship 30 features a day at a fraction of payroll cost.
2. Most teams will halfway it, turn their senior engineers into AI PR reviewers, and end up worse off than before they started.
3. A handful of teams will resist entirely and watch their competitors lap them.

Which group your business ends up in is being decided right now.

We're betting on group one. We're building Alchemist for the people who are.

Will your engineering team ship the next decade of your product, or will an autonomous cluster ship it for you?

If the answer is the cluster, the question is whose.

**[Join the Alchemist waitlist →](/#waitlist)**

---

*Hunter Hodnett is co-founder and CTPO of [Chipp](https://chipp.ai), the platform behind 105,000+ AI agents reaching 89M+ end users. He previously engineered at Reddit, Amazon, and The Home Depot, and trained 290+ engineers as a coding bootcamp instructor. Alchemist is Chipp's autonomous development product, currently in closed alpha. The waitlist is open at [adaas.dev](https://adaas.dev).*