Distillation Is Your Moat

I’ve spent the last six posts explaining how to build an autonomous coding cluster on top of Anthropic’s models. This is the post about why that’s a temporary arrangement.

If you’re going to build a real business on top of agentic AI, you need to understand the platform risk you’re taking. And you need to understand the move that mitigates it (distillation) because it’s both the long-term moat for AI-native businesses and, increasingly, the most contested issue in AI policy. The AI labs are about to start lobbying to make distillation harder, slower, or illegal. The reasons they’ll give will sound noble. The actual reasons are about preserving their monopolies. Pay attention to this fight.

This post covers what distillation is, why it matters, what we’ve learned doing it, and why the regulatory dimension is the part you should be loudest about.

The platform risk

Right now, in early 2026, Anthropic has the best coding model. By a meaningful margin. It’s not close.

If you’re building an autonomous coding system, you use Claude. Cursor uses Claude under the hood for most of their work. Cognition (Devin) uses Claude. We use Claude at Chipp. The companies advertising “Claude alternative” or “Gemini-powered” coding agents are quietly using Claude when their customers care about output quality.

That dominance is real and it’s a problem.

Right now, Anthropic prices their Claude Max plans at $100-$200 per month with very generous usage caps. They subsidize this: every Claude Max session costs them more in compute than they’re charging. They do this because they’re capturing the market, training on the data, and improving their models faster than anyone else. They’re playing the Uber-rides-cost-$3 phase of the platform game. We benefit.

Eventually that ends. Anthropic, or the company that buys them, has to stop subsidizing. Prices go up. Rate limits get tighter. Specific use cases (say, “running thirty parallel agents on a single machine”) get reclassified as enterprise-tier and priced accordingly. We saw the first sign of this when Claude Code quietly moved from the Pro plan to the Max plan in April. That’s not the last move.

If you’ve built your business on Claude, every move Anthropic makes is a move against your margin. Eventually they’ll capture so much of the value you create that the only way out is to cut a deal. That’s not a hypothesis. That’s how every platform plays out, every time.

The defense is distillation.

What distillation is

Distillation, in plain language: take the frontier model’s outputs and use them as training data for a smaller, cheaper model that you own.

In a normal training run, you’d train a model from scratch on raw data: the entire internet, code from GitHub, documentation. That takes hundreds of millions of dollars and 18 months. Distillation skips most of that. You start with an open-source base model: a smaller, less-capable model someone else has already pre-trained at vast expense and released for free. Then you fine-tune that model on a curated dataset of high-quality input/output pairs. Each pair is a question or task, and the answer the frontier model gave to it.

The smaller model learns to mimic the frontier model’s behavior on the specific kinds of tasks you fine-tune for. It doesn’t get smarter at everything. But it gets much smarter at the slice you trained for. For a narrow, specialized use case (say, autonomous coding in your specific tech stack), a distilled model can hit 60-90% of the frontier model’s quality, at a fraction of the inference cost, on hardware you control.

This is how DeepSeek matched OpenAI’s GPT-class models for $5 million in training compute, against the billions OpenAI spent. They distilled. The Chinese open-source community is pioneering this technique, partly because it’s an effective way to compete without matching American capital expenditure, and partly because they’ve correctly identified that the frontier-model arms race is a winner-takes-most game they can’t win head-on.

What it costs to distill

I’ve been distilling our autonomous coding cluster’s outputs for the last few months. The honest scoreboard:

We’ve spent about $2,500 on training so far, across a few experimental runs.
Our best distilled model, based on Qwen 2.5 14B, performs at about 63% of Claude Opus 4.6’s quality on our internal benchmarks.
Inference cost on the distilled model is roughly 1/100th of inference on Opus. About $0.04 per ticket vs $4 per ticket.

63% sounds bad. It’s not. Most of what an autonomous coding cluster does is not the frontier of model capability. It’s the routine work. Reading code, writing CRUD, applying conventions, running tests. A 63% Opus model that costs 1% as much will handle 80% of the tickets. Frontier models can handle the remaining 20%.

The economics of “cheap distilled model handles routine tickets, frontier model handles edge cases” are very different from “frontier model handles everything.” The former is sustainable at scale. The latter is a bet that Anthropic stays charitable.

The distillation process itself is straightforward in shape and difficult in detail. You collect training data: every prompt your frontier model has answered, every tool call sequence, every successful output. You curate it (this is the hard part; bad data poisons the training). You pick a base model and a fine-tuning approach. You run the training. You evaluate. You ship.

We use Hugging Face for the model hosting and the training infrastructure. The base models we’ve experimented with most are Qwen (Alibaba’s open-source family, currently the strongest open coding models) and Llama (Meta’s open family). We use supervised fine-tuning (SFT), the simpler of the two main approaches. The other approach, preference-based methods like DPO and GRPO, requires far more data and is what DeepSeek used to match GPT. SFT is the entry point. The amateur approach. It works.

The training data is the asset. We’ve recorded every successful autonomous run on our cluster: every prompt, every tool call, every output, every outcome label (did this fix actually solve the issue?). After a few months, we have a dataset that’s specific to our use case in a way no public dataset is. That’s the moat.

Save your training data now

Here’s the operational lesson: start saving your training data today, even if you’re not training yet.

If you’re using Claude Code interactively, your chat history is on your machine. Move it to a stable location. Back it up. Tag the conversations by outcome: did the work ship? Did it break? Did you have to re-prompt? Outcome labels are what makes training data valuable, and outcomes are easy to capture in the moment and impossible to reconstruct after the fact.

If you’re building a product on Anthropic’s API, log every request and response. Tag the requests by feature area, by use case, by outcome. Store the logs in a database, not a logfile. You will want to query them.

If you’re operating a SaaS that uses LLMs, your customers’ interactions with your AI features are training data. Make sure your terms of service give you the right to use that data for model improvement, and make sure the data is being captured in a usable format.

Three years from now, when distillation is the obvious move, you’ll either have years of high-quality training data or you’ll have to start from scratch. The decision to capture the data is one you make now. The decision to train on it is one you can defer.

Why this fight is coming

The major AI labs do not want you to distill. Distillation breaks their business model. If everyone can take the outputs of the frontier model and train cheaper, comparable models for narrow use cases, the labs lose pricing power.

Their playbook, which is starting to surface in policy discussions, has three moves.

Move 1: redefine distillation as theft. Rename it “model output exfiltration” or “intellectual property circumvention.” Argue that training one model on another model’s outputs is a copyright violation. (The legal argument here is weak, since model outputs are not copyrightable in the same way human writing is, but the labs have a lot of lobbyists.)

Move 2: lobby for export controls. Argue that distilled models are a national security risk because foreign adversaries can use the technique to catch up with American AI capability. Get a regulation passed that requires labs to add clauses to their terms of service prohibiting distillation, with criminal penalties for violation. (DeepSeek will be invoked. The implication will be that all distillation is geopolitically dangerous. The actual concern is monopoly maintenance.)

Move 3: technical countermeasures. Watermark model outputs in ways that distilled models will inherit, then sue anyone whose model produces watermarked-output-style behavior. (Technically hard. The labs are trying anyway.)

If you care about a competitive AI ecosystem (and you should, because monopolies are what drive prices up and innovation down) distillation is the single most important policy issue to follow. Pay attention to who’s introducing legislation. Pay attention to who’s funding it. The “AI safety” framing will be heavy in the air. Most of it will be downstream of monopoly preservation.

The longer arc

Where I think this all heads, in five years:

Open-source models will be 80-90% as capable as frontier models on narrow tasks, at 1% of the inference cost. Specialized industries will run their own fine-tuned models on dedicated hardware. There are companies right now taking AI models and embedding them on hardware chips, and that’s a real direction. The frontier labs will sell the very-hardest-task tier as a premium service, but the routine work will not be on their infrastructure.

That’s the future where AI-native businesses are durable. The opposite future, where frontier labs own all inference, all data, and all margin, is the one the labs are working toward.

Distillation is the lever. Yours, mine, the open-source community’s. Pay attention to this. Fight when it’s time to fight.

Wrapping the series

This is the last post of the five-part engineering series the manifesto promised. The arc:

The Autonomous Development Manifesto: the printing press analogy and why this matters now.
Context Engineering: the discipline that determines whether your agents work.
Building a Self-Healing Bug Bot: the architecture of a real autonomous cluster.
The Bash Harness: the manager that supervises the brilliant intern.
Distillation Is Your Moat: the long game.

If you’ve read all seven, you understand the technology and the strategy of autonomous coding as well as anyone working in the field today. There is no secret you’re missing. The work is in the doing.

If you want a head start (a cluster already tuned, an autonomous engineering team you can describe a SaaS to and watch ship it, a stack you can eject from any time you want) join the Alchemist waitlist. We’ve spent a year and six figures of compute building this. We’re packaging it for you.

Either way: stop being a spectator. The next three years are the formative ones for AI-native software, and the people building right now are the ones who will be telling the story.

Get in the arena.