Harness engineering - Anurag Dhungana

My understanding from building agent infrastructure: how models become systems you can operate, inspect, and trust.

The first useful thing I learned about agents was that the model was not the part I kept fixing.

The model was usually good enough to try the next step. It could read the error. It could explain the file. It could call the tool. It could make a plausible plan.

The failures came from everything around it.

The agent lost track of what it had already done. It trusted stale memory. It called the wrong tool because every tool looked equally available. It kept going after a failure it should have surfaced. It stopped because it had produced an answer, not because the task was actually done.

That is when the word "agent" starts to feel too vague.

Sometimes people mean a chatbot with tools. Sometimes they mean a coding assistant. Sometimes they mean a workflow with a model call inside it. The word has been stretched enough that I think it needs a cleaner frame.

The frame I keep coming back to is simple:

The model is the brain.

The harness is the body.

The model reasons. The harness lets that reasoning touch the world.

That distinction changes where you look when an agent fails. The answer is not always "use a better model" or "write a better prompt." A lot of the time the answer is: the harness did not give the model the right state, tools, memory, feedback, visibility, or stopping condition.

That is the part I keep running into in my own projects.

Shrimp is a harness around an LLM. Dispatch is a control plane for deployable agents. Switchboard is a harness around chat platforms. Metro MCP is an MCP server that taught me how much agent tools depend on schemas, permissions, transport, and client behavior.

Different names.

Same pressure.

The interesting part is usually not the model, the API call, or the first happy-path demo.

The interesting part is the harness around it.

The model is the brain. The harness is the body.

Prompting gets you started

The first layer is prompting.

You tell the model what you want: write this email, summarize this file, fix this bug, return JSON, use this tone.

For small tasks, that works. The instruction is clear, the context fits in the prompt, and the answer comes back in one shot.

The problem starts when the task stops being one shot.

Real work has steps. The model needs to inspect files, choose tools, compare results, remember what happened earlier, and sometimes ask before doing something risky.

At that point, the prompt still matters, but it is not the system.

It is just one input.

Context helps, until it becomes the problem

The next layer is context engineering.

Instead of hoping the model guesses correctly, you give it the right material: docs, files, prior messages, tool outputs, user preferences, examples, constraints.

This is a big jump. A plain prompt with the right context usually beats a clever prompt with no context.

But then a new problem shows up.

Something has to decide what context gets loaded.

Should the agent read the README or the test file? Should it search memory or inspect the current diff? Should it trust an old note or re-check the code? Should it keep going after a tool error or stop and ask?

Those decisions do not live in the prompt.

They live in the harness.

This is one of the things I learned building Metro MCP. The useful part was not just exposing transit data as a tool. It was making the tool legible to agents: what inputs were valid, what the output meant, how OAuth worked, which transport the client expected, and what failure looked like.

Once the right context was in the room, the agent got more useful.

Not because the prompt was magical.

Because the harness found the evidence.

Context management as an aquarium filter: signal stays visible, old output gets filtered out.

The harness is the loop

Most agents have a loop somewhere inside them.

The agent loop is small; reliability comes from the rails around it.

Look at the task. Pick the next action. Call a tool. Read the result. Update state. Decide whether to continue.

That loop is easy to draw and annoying to build.

What tools are available? What do their inputs look like? What happens when one fails? Which actions can run automatically? Which ones need approval? Where does memory live? How does the user see progress? What counts as done?

If you do not answer those questions, the model has to improvise.

Sometimes that works. Often it produces the familiar agent failure mode: a confident loop of half-correct actions that looks busy but never actually lands the task.

Shrimp came from running into this over and over.

I kept building agents where the model was not the hard part. The hard part was everything around it: the ReAct loop, event bus, capability registry, approval gate, dashboard, memory, subagents, browser control, and app integrations.

Once that was clear, Shrimp became the obvious shape: keep the model swappable, keep the capabilities pluggable, and make the core loop visible.

The model thinks.

The harness lets it act.

Some harnesses become control planes

Dispatch is the same lesson at a larger scale.

It is not just a chat box pointed at a model. The useful part is the operating surface around the agent: create an agent, provision a Daytona sandbox, write its runtime config, start OpenCode, stream tool calls back to the UI, persist sessions and runs in Convex, connect channels, and run scheduled automations.

That is a lot of plumbing for one sentence: "talk to an agent."

But that plumbing is the point.

If the sandbox is cold, something has to wake it. If OpenCode is stale, something has to rebuild it. If a tool call streams in a weird shape, something has to normalize it. If a run fails, something has to store the error in a way a human can inspect later. If the same agent talks in the web app, Slack, Telegram, and a cron job, something has to make those surfaces share the same execution path.

This is where "agent infrastructure" stops being abstract.

A model call becomes a system when you can operate it. You can see the session. You can see the run. You can see the tool events. You can see the failure category. You can restart the sandbox. You can change the permissions. You can ask why something happened without digging through vibes.

That is what I mean by harness engineering.

Some harnesses are built for one task

Not every harness has to be a full product runtime.

Some harnesses are dynamic. They exist for one hard task, then disappear.

This is what clicked for me in the recent Claude Code dynamic workflows writing. A workflow can be a temporary harness: a small JavaScript coordinator that spawns subagents, gives each one a narrow job, waits for results, verifies them, and loops until a stop condition is met.

That matters because some tasks need shape more than they need another paragraph of instruction.

If a flaky test fails once every 50 runs, one agent in one context window may chase its first theory too hard. A workflow can spawn independent hypothesis agents, run them in worktrees, and then ask verifier agents to attack each theory.

If a draft needs every technical claim checked, one agent can miss claims or trust its own answer. A workflow can extract claims, assign each claim to a checker, ask another agent to inspect source quality, and only then revise the draft.

If you need to rank 100 support tickets, one prompt will blur. A workflow can bucket, compare, dedupe, and merge.

The useful patterns are not complicated:

classify and route
fan out and synthesize
generate and filter
make agents compete
verify adversarially
loop until done

The point is not "more agents."

The point is that the harness can change the shape of the work.

What the body is made of

A harness is not one magic component.

It is a pile of ordinary things that work together:

local instructions like AGENTS.md or CLAUDE.md
progress files
task templates
tool schemas
permission gates
memory stores
eval scripts
browser or runtime feedback
logs and traces
stop conditions

That list looks boring. That is why it works.

The model does not need a mystical environment. It needs a legible one.

A bad harness gives the model a giant prompt, exposes every tool, hides state in conversation history, skips verification, and returns only a final answer.

A better harness gives the model grounded context, scoped tools, durable state, feedback loops, and a visible trace.

Prompting tells the model what to do.

Context tells the model where it is.

The harness decides what the model can do, what it can see, how it checks itself, and when it must stop.

The environment has to be legible

A good harness does not just give agents tools. It makes the environment legible.

That word matters.

If the agent cannot see a fact while it is working, that fact does not exist.

A Slack thread, a Google Doc, a design decision in someone's head, a random note from last week: all of that may be true, but the agent cannot use it unless the harness exposes it in a form the agent can inspect.

This is why repo-local instructions matter. An AGENTS.md file is not an encyclopedia. It is more like a map. It tells the agent where the important things live, what conventions matter, and what to check before acting.

The deeper knowledge still lives in the repo: code, tests, docs, plans, references, quality notes, commit history, and checks the agent can actually run.

That matches what I have been learning in my own systems.

Metro MCP works better when tool schemas, auth behavior, and transport contracts are explicit. Shrimp works better when tool calls, events, and approvals are visible instead of hidden inside a final answer. Dispatch works because agents are not only conversations; they are sandboxes, sessions, runs, messages, memory files, channel bindings, automations, and failure states.

The harness turns hidden work into something the agent can read.

Hidden work becomes a system of record the agent can read and update.

Tools need contracts

It is tempting to think of tools as simple functions.

Tools need contracts and permission levels: auto, ask, and block.

The model calls read_file, search_web, send_email, run_tests, whatever. Function in, result out.

That is too loose.

For an agent, a tool needs a contract. The model needs to know what the tool does, what inputs are valid, what the output means, and what failure looks like.

More importantly, the harness needs to know what kind of action this is.

Reading a file and sending an email are not the same category of thing.

Searching docs and deleting data are not the same category of thing.

So the harness needs permission levels. Some tools can just run. Some should run but notify the user. Some should ask first. Some should not exist in that environment at all.

Dispatch made this feel less theoretical.

An agent can have conservative, balanced, or permissive tool permissions. It can have Composio toolkits scoped to that agent. It can have Slack or Telegram bindings. It can have a runtime config generated from its model, persona, tools, memory, and channel context.

Those are not prompt details.

They are contracts.

The model should not have to guess which tools exist, which ones are safe, or what a tool result means. The product layer should know that before the model starts acting.

Coop still matters here as a smaller version of the same idea: if an agent definition has no schema, no validator, no warnings for lossy export, and no stable source of truth, you do not have an agent definition. You have a screenshot of some settings somewhere.

The harness makes the contract visible.

Planning and execution are different jobs

One of the easiest ways to make an agent worse is to ask it to plan, build, judge, and summarize in one continuous blur.

It will do it.

That does not mean the shape is good.

Planning and execution benefit from separation. Before the model starts changing things, the harness can force a plan into a concrete shape:

real file paths
real symbols
existing patterns
acceptance criteria
verification steps
known risks

That does not have to be a heavyweight spec. It can be a small impact map.

The point is to stop the agent from hallucinating the environment before it acts inside it.

This also explains why planner/generator/evaluator setups work better than they sound. The planner turns the request into a grounded task. The generator implements one slice. The evaluator checks the result like a skeptical user.

The same model may play each role.

The harness keeps the roles from collapsing into each other.

Adapters keep the mess out of your head

Switchboard taught me the same lesson in a less agent-shaped place.

Discord, Slack, and Telegram all have messages. They all have replies. They all have events. They all have annoying little differences that leak into your code if you let them.

Switchboard is a harness around those platforms. One Bot API on top, adapters underneath.

Your handler should not care whether the message came from Discord or Slack. It should care about the user, the message, the reply, and the conversation.

The rest belongs in the adapter layer.

This is the part I keep accidentally caring about.

Where does the messy outside world end?

Where does the clean interface begin?

That boundary is usually the product.

The pattern underneath

Shrimp, Dispatch, Switchboard, and Metro MCP are not the same project.

But they rhyme.

Each one sits around a messy capability and gives it structure.

Shrimp sits around an LLM and gives it tools, memory, approvals, and visibility.

Dispatch sits around deployable agents and gives them sandboxes, sessions, state, channels, automations, and an operator-facing control plane.

Switchboard sits around chat platforms and gives them one API.

Metro MCP sits around public transit systems and gives agents a cleaner way to ask for live context through MCP tools.

The repeated lesson is simple:

The hard part is rarely making something happen once.

The hard part is making the shape around it stable enough that you can keep using it.

That means interfaces.

It means normalization.

It means logs.

It means explicit failure states.

It means approvals.

It means boring names for things that would otherwise stay fuzzy.

It means knowing what is inside the core and what is an adapter.

That last one matters a lot.

If everything is core, the system becomes fragile.

If everything is a plugin, the system has no opinion.

Harness engineering is deciding where the boundary goes.

Memory is not a bigger prompt

Memory gets hand-waved a lot.

People talk about it like you can stuff more text into the context window and call it a day. That works for demos. It breaks fast in real use.

An agent has different kinds of memory.

There is the short-term state of the current task. There are long-term facts about the user. There is project memory: how this repo works, what commands are safe, what past decisions matter. There is procedural memory: the way the agent learned to do a kind of work.

The harness decides where those memories live and when they get pulled back in.

It also has to treat memory with suspicion. Old memory can be stale. Summaries can flatten the important detail. A preference can be true in one project and wrong in another.

If the agent cannot tell where a memory came from or when it was last checked, it will eventually make a confident mistake.

Good memory needs provenance.

Bad memory is worse than no memory.

Dispatch forced this distinction for me. OpenCode and the Daytona volume are good places for runtime memory: session history, workspace files, config, and the agent's own notes. Convex is better for product memory: agent records, runs, sessions, messages, channel bindings, automation status, failures, credits, and things the UI needs to inspect.

Those are different jobs.

If you collapse them into one blob, the product cannot operate the agent and the agent cannot reason cleanly about its own work.

Metro MCP taught me a version of this through tools rather than notes. A tool result is memory for the next turn, even if it only lives in the context window for a few minutes. If the output has no provenance, no timestamp, no station identity, or no clue about whether the data came from WMATA or MTA, the agent will eventually sound more certain than it should.

If there are two sources of truth, there is no source of truth.

So the boring details matter: typed inputs, normalized station names, explicit provider boundaries, and errors that tell the model what failed instead of forcing it to guess.

Observability is not optional

If an agent does ten steps and only shows you the final answer, you do not know what happened.

Maybe it read the right file. Maybe it skipped the only file that mattered. Maybe a tool failed and it guessed. Maybe it got lucky.

For toy tasks, that might be fine.

For real work, it is not.

A good harness shows the path: tool calls, outputs, errors, approvals, retries, and changes in direction.

Not because logs are pretty.

Because inspection is how trust gets built.

You cannot trust an agent you cannot audit.

This is another place where the recent harness writing was useful. A serious coding agent is not just a chat box pointed at a repo. It can observe the app, read logs, run checks, inspect screenshots, make a change, and loop.

That is different from pasting errors into a chat.

It means the agent can observe the system directly.

That is harness engineering.

Not a better prompt. A better feedback loop.

Addy's "Orchestration Tax" article describes the human side of the same problem. Starting more agents is cheap. Closing the loop on their work is not. If every agent output routes straight into your brain, you become the single-threaded bottleneck.

A useful harness does not remove judgement. It protects it.

It makes the agent prove the boring parts first: tests, screenshots, traces, structured diffs, reproducible steps. Then your attention goes to the part that actually needs a human.

Without a harness, more agents dump work onto the human bottleneck. With a harness, routine evidence is filtered first.

Verification needs a separate path

The most dangerous version of an agent is one that can act but cannot check.

Separate the doer from the judge: one agent builds, another verifies with tests and screenshots.

It will produce work.

It will also produce explanations for why the work is fine.

That is not enough.

A harness needs feedback paths the model does not get to hand-wave away:

tests
type checks
linters
screenshots
runtime logs
comparison agents
human approval gates
explicit acceptance criteria

Some feedback is computational. A test passes or it fails. A type checker returns errors or it does not.

Some feedback is inferential. A reviewer agent checks whether the UI matches the request. A human decides whether the tone works. A browser screenshot reveals that the button technically exists but looks wrong.

You need both.

This is where separating the doer from the judge matters. A generator is usually generous about its own output. A standalone evaluator can be skeptical.

The harness should make skepticism cheap.

One thing at a time is a system feature

Agents get worse when the task gets too wide.

One thing at a time: progress, build, verify, commit.

They run out of context. They silently drop requirements. They make one good change and three unnecessary ones. They forget to verify the thing the user actually asked for.

The fix is not only "be more careful."

The fix is often a harness rule:

Read progress. Pick one failing feature. Implement it. Verify it. Commit it. Update progress. Repeat.

That sounds basic, but it changes the work. The agent no longer has to keep the entire project in its head. The progress file holds the queue. The repo holds the state. The tests hold the acceptance criteria. The commit boundary gives each slice a place to land.

One thing at a time is not a motivational slogan.

It is infrastructure.

Stopping is part of the system

One of the least glamorous parts of agents is stopping.

When is the task done? When should the agent retry? When should it ask for help? When should it give up?

If the harness does not define this, the model guesses.

That is how you get agents that keep searching after they already found the answer, keep editing after the fix is done, or declare success because they produced a plausible final message.

The harness can make the boundary explicit.

Maybe tests have to pass. Maybe a screenshot has to be generated. Maybe a destructive action needs approval. Maybe the agent must produce a specific artifact before it can say it is finished.

The model still reasons.

The harness decides what "done" means.

Harnesses need garbage collection

Agent systems produce stuff quickly.

Some of it is good. Some of it is weird. Some of it is locally fine but creates long-term drift.

If you do not build cleanup into the harness, the system slowly teaches future agents the wrong patterns.

You need recurring cleanup, mechanical checks, and a way to turn human taste into rules the agent can keep applying.

Otherwise the harness gets heavier every week.

This is also where "build to delete" matters.

Build to delete: test harness components for quality, cost, and time, then remove what stops helping.

Every harness component encodes a belief about what the model cannot do yet.

A planning gate says the model is not reliable enough to plan and execute in one pass. A verifier says the generator is not reliable enough to judge itself. A detailed tool description says the model needs that guidance to use the tool correctly.

Those beliefs can expire.

As models improve, some harness components stop earning their keep. A rule that prevented mistakes in March can become token overhead in June. A complex workflow can become slower than a simpler one. A tool list can grow until the model performs worse because it sees too many options.

Good harness engineering is not adding scaffolding forever.

It is adding constraints, measuring whether they help, and deleting the ones that stop paying rent.

Same brain, different body

This is why two products using the same model can feel completely different.

One feels chaotic. Another feels reliable.

The difference is usually not the model. It is the body around the model.

The prompt matters. Context matters. The model matters too. But the agent experience is the whole thing: tools, memory, approvals, state, logs, UI, evals, and stop conditions.

That is the harness.

And once you see it, a lot of agent work starts to look different.

The question is not only:

Which model are you using?

The better question is:

What have you built around it?

Because that is where most of the product lives.

For a while, I would have called this agent engineering, dev tools, or workflow automation, depending on the project.

Those labels are still true.

But they miss the thing I keep caring about.

I care about the shape around the intelligence.

I care about the layer that turns a powerful but slippery capability into something you can operate, inspect, move, and trust.

That is why I keep building runtimes, formats, adapters, queues, ledgers, approval gates, and validators.

The model is easy to point at.

The API is easy to demo.

The first version is easy to believe in.

The harness is what survives contact with actual use.

Sources I found useful

Caleb Writes Code, "Agent Harness explained in 8min.."
OpenAI, "Harness engineering: leveraging Codex in an agent-first world"
Zhong and Zhu, "AI Harness Engineering: A Runtime Substrate for Foundation-Model Software Agents"
Thariq Shihipar and Sid Bidasaria, "A harness for every task: dynamic workflows in Claude Code"
Akshay Pachaar, "The Anatomy of an Agent Harness"
Rahul, "Harness Engineering: What Every AI Engineer Needs to Know in 2026"