# The Bash Harness: Determinism Wraps Non-Determinism

> An autonomous Claude session without a wrapper is a brilliant intern with no manager, no deadline, and an unlimited API budget. The bash harness is the manager. Here's what it enforces, why it has to be deterministic, and the specific failure modes it prevents.

There's a moment in every autonomous-coding-cluster project where the engineer realizes their problem isn't the model. The model is fine. The model would happily spend twelve hours and $400 trying every possible solution to a misconfigured environment variable. The problem is the *manager*.

Claude on its own (even Claude 4.6, even Claude with full tool use, even Claude with MCP and a perfect `CLAUDE.md`) has no concept of "this task should have been done an hour ago" or "I've made the same mistake three times in a row." It doesn't quit. It doesn't get bored. It will burn your entire token budget if you let it.

The bash harness is the part of the cluster nobody writes blog posts about, because bash isn't sexy. It's also the part that makes the whole thing reliable enough to push to production. This is that post.

## The framing

Cline put it best, in a documentation update I read after building most of our harness: *"An autonomous agent without a bash harness is a brilliant intern with no manager, no deadline, and an unlimited API budget."*

That's the threat model. The model is the intern. The bash harness is the manager.

The intern is genuinely good. The intern can do work that would take a human three days. The intern can't *self-supervise*. If the intern wanders into a corner of your codebase that has no test coverage and decides to refactor it from scratch, the intern won't notice that's not what you asked for. If the intern can't figure out how to fix a bug, the intern won't say "I give up." They'll keep trying, with progressively more elaborate workarounds, until something stops them.

The bash harness is what stops them. Specifically, it does six things.

## What the harness does

### 1. It runs Claude headlessly

The headline command is `claude -p "your prompt here"`. The `-p` flag puts Claude Code in non-interactive (headless) mode. Instead of opening a chat UI, Claude runs the prompt to completion, streams its work to standard output, and exits when done.

We pair `-p` with `--dangerously-skip-permissions`, which tells Claude to stop asking for confirmation before every tool call. In an interactive session, you want those prompts. In a headless session that needs to run unattended, you don't. (The `--dangerously-` prefix is honest. Skip permissions and the agent can do real damage. The bash harness is what makes this safe.)

The combination (headless mode plus skip permissions) is the atomic primitive of every autonomous coding system. If you only learn one thing from this whole series, learn that. Without it, the model is a chatbot. With it, the model is a worker.

### 2. It enforces a wall-clock timeout

If Claude has been running for two hours, something is wrong. Either it hit a class of problem it can't solve, or it's looping, or the task was wildly under-scoped. Either way, killing it is the right move.

Our harness wraps `claude -p` in a `timeout 7200` call (2 hours, expressed in seconds). When the timer hits, the process is killed, the work-tree is captured for forensics, and the harness logs a failure ticket for human review.

The number isn't sacred. We started at 8 hours, dropped to 4 when we realized we were paying for runaway sessions, and settled at 2 because every legitimate task we've measured fits well under that. Pick a number that's longer than your slowest legitimate task and shorter than "I will go bankrupt."

### 3. It detects hangs

Claude can be running but not making progress. Sometimes it gets stuck on a tool call: a Docker container that takes forever to start, an MCP server that hangs, an HTTP request to a slow third party. The wall-clock timeout catches this eventually, but you don't want to pay for two hours of nothing.

The harness watches the agent's stdout for *tool call activity*. Every meaningful turn produces a tool call entry. If five minutes pass with no tool call output, the harness assumes the agent is stuck and kills it. The threshold is a tunable. We started at 2 minutes and bumped to 5 because some legitimate operations (large grep on a big codebase, slow MCP servers) take longer.

### 4. It bans dangerous patterns

The agent will, occasionally, output a command that would do irreversible damage. Not from malice. From training data. Real examples we've caught:

- `git reset --hard HEAD~5` (when the agent gives up and tries to wipe its work)
- `git push --no-verify` (when CI fails and the agent decides to bypass the check)
- `rm -rf /` (rare, but yes)
- `--force-overwrite` flags on database migrations

The harness greps the agent's command stream for known-bad patterns. When it sees one, it kills the session and logs a ticket. We curate this list as we discover new ones.

This is defense-in-depth. The model *shouldn't* output these commands; we have rules in `CLAUDE.md` against most of them. But the model occasionally will, and "shouldn't" is not the same as "won't." The harness is the airbag.

### 5. It cleans up after the agent

Each agent session runs in its own *git work-tree*: a separate checkout of the repo, isolated from every other agent's work. When a session ends, the work-tree might contain uncommitted changes, half-finished branches, leftover processes (dev servers, browser instances, MCP servers).

The harness runs a cleanup pass. It checks the work-tree for uncommitted changes (and either commits them with a salvage marker or stashes them for forensics, depending on session outcome). It runs `pkill` on any orphaned processes that started during the session. It deletes the work-tree if the session succeeded, or archives it if the session failed.

Without this, every failed session leaves debris. After a hundred sessions, your `~/code/worktrees` directory is a graveyard. Cleanup is unglamorous and load-bearing.

### 6. It enforces the push step

When a session succeeds, the agent should push its commit to a target branch. Sometimes the agent forgets, or mis-commits, or commits without pushing. The harness checks the work-tree state at the end of every session: are there committed-but-unpushed changes? If yes, the harness pushes them. Are there changes the agent forgot to commit? If yes, log a failure. The agent didn't finish.

This is the final guardrail. The whole pipeline ends with code in production. If the agent's work doesn't end up there, the entire run was wasted. The harness *makes sure* the work lands.

## Why bash, specifically

I wrote the first version of the harness in TypeScript. It was elegant. It had types. It also took ten times as long to debug as the bash version that replaced it.

Bash is the right language for this. Three reasons:

1. **The harness wraps a Unix process, and bash is the language for wrapping Unix processes.** Subprocess management, signals, file descriptors, timeouts, pipes. All of this is expressed concisely in bash and verbosely in everything else.
2. **The harness has to be debuggable in production.** When something goes wrong, you SSH to the machine and read the script. Bash scripts are visible. There's no compilation step, no module resolution, no runtime to set up. You can audit the entire harness in fifteen minutes.
3. **The harness should be boring.** It's the deterministic core of an otherwise non-deterministic system. The fewer abstractions, the fewer dependencies, the better. Bash has no dependencies. It will work on any Linux box, with no setup, in five years.

The harness is currently 312 lines of bash. Most of those lines are comments. The actual logic is short.

## Each agent gets its own work-tree

A note on parallelism, because it's the part that surprises people.

We run eight agents at a time on my laptop. They don't share a working directory. They don't share a database. They don't share dev server ports. Each agent has:

- Its own git work-tree (a checkout of the repo at a separate filesystem path).
- Its own dev server, on a unique port (5180-5187, parameterized by worker slot).
- Its own browser MCP target (Chromium instance, also on a unique port).
- Its own log stream, prefixed with the worker slot for easy filtering.

The agents are completely isolated. Two agents working on conflicting changes won't step on each other. They're in different work-trees, on different branches. When they push, the push happens on the same remote, and any merge conflicts are resolved by the deploy pipeline (or, in the rare case where they're not auto-resolvable, surfaced as a failure ticket).

This is the part of the architecture I was most nervous about and it's worked the best. Eight independent workers running in parallel turns out to be the right level of concurrency for one developer's machine. Going higher saturates my Mac's resources. Going lower means tickets sit in queue.

## The lesson

Every autonomous coding system you build will fail in ways the model can't recover from. The model is good at writing code. It's bad at noticing it's wasted three hours, bad at stopping itself, bad at cleaning up its mess. The bash harness is the part of the system that's good at those things.

Without it, you have a hobby project. With it, you have something you can leave running overnight.

This is the unglamorous core of every production-grade agentic system I've seen. Cursor has one (their internal harness for background agents). Devin has one. We have one. If you're building your own cluster, you'll build one too. Mine will save you a few weeks of mistakes.

In the final post of this series, I'll get into distillation, the part that turns short-term Anthropic-dependence into long-term independence, and the part the AI labs are quietly going to lobby against. If you've followed this series this far, distillation is the post that matters most for your business in 2027.

If you'd rather skip the year of building and use a cluster that already has all of this, [join the Alchemist waitlist](/#waitlist).
