The model is easy. The body is hard.
I used to think most of the pain in agent systems would come from prompts.
Not all of it, obviously. But enough of it that if I got the prompting right, the rest would feel manageable.
Building Shrimp cured me of that pretty quickly.
Shrimp is an open-source agent harness I started pulling together after I got tired of rewriting the same glue around different models. I did not set out to make some grand statement about agents. I mostly wanted a runtime that could survive contact with real I/O: tool calls, approvals, sessions, memory, browser control, and all the awkward wiring that starts to matter the second a model has to do anything outside a text box.
That changed how I think about where the hard part actually lives.
The model is the easy part.
Or at least, it is easier than the body around it.
The body is everything that has to hold together once the model leaves the sandbox. Tool execution. Session state. Approval flows. Error handling. Eventing. Memory. Cost tracking. The unglamorous stuff that determines whether the system is usable, debuggable, and safe after the demo energy wears off.
I kept running into the same pattern: the model call itself was rarely the thing making my life miserable. What hurt was the runtime around it getting tangled.
The first shift: make the loop observable
My first versions of the agent loop were heading in a familiar direction. Everything wanted to know what the agent was doing, so everything got wired into the loop itself.
Logging lived there. Approvals lived there. Dashboard updates lived there. Cost tracking lived there. Session persistence lived there.
The result was predictable. The loop got bigger and more fragile. A feature that should have felt adjacent started feeling invasive because adding anything new meant touching the one place that already felt overloaded.
The fix was not especially glamorous. I turned the loop into something that emitted events.
That sounds like a small implementation detail. For me it changed the entire shape of the system.
Once the loop became observable, it stopped needing to know who was listening. The dashboard could subscribe. Session persistence could subscribe. Cost tracking could subscribe. Future features could subscribe without turning loop.ts into a landfill.
That separation matters more than people give it credit for.
There is also a subtler effect: an observable loop is psychologically easier to work on. When I can see what the system is doing and attach features around it instead of inside it, I stop treating the core loop like a bomb I should not touch. That matters if you want the project to survive beyond the first few clever demos.
One thing I learned the hard way: observability is only real if observers can fail in isolation. If one bad subscriber can crash the loop, your event system is decorative. That sounds mechanical, but it is load-bearing. A dashboard bug should not take down the agent. Neither should a persistence bug or a cost meter bug.
The second shift: memory should include methods, not just facts
Most conversations about memory in agent systems focus on facts.
Facts about the user. Facts from documents. Facts from prior sessions.
That stuff matters. I am not arguing against it.
But building Shrimp pushed me toward a different kind of memory that I think is under-discussed: memory the system writes about its own behavior.
The version I have right now is simple. Honestly, embarrassingly simple.
If the agent finishes something that involved a multi-step tool sequence, it can write down the rough procedure it used. Not the final answer. The method.
Something like:
"Last time you asked me something like this, I used these tools in this order."
That sounds obvious once you say it plainly, but it felt different when I saw it happen in practice.
The useful thing was not that the system remembered a fact I had fed it. The useful thing was that the harness remembered a pattern it had generated itself.
That is a different class of help.
It is also closer to how a good runtime should compound. Not by pretending the model got smarter overnight, but by making the surrounding system less likely to flail the next time it sees a familiar shape.
The current implementation is still rough. The matching is dumb. Retrieval can be noisy. There has to be a way to forget bad procedures or you just create a self-poisoning memory system. But even the crude version changed how I think about where leverage lives.
People spend a lot of time asking how smart the brain is.
I am more interested in how much the body can learn from its own behavior.
This also changes how I think about “harness thickness”
A lot of agent-framework discussion reduces to a familiar choice:
- keep the harness thin and trust the model
- put more explicit logic into the harness
That framing is useful, but it misses a third shape I keep seeing in practice.
If the runtime is observable and can accumulate procedural memory, you do not have to decide all of the thickness up front.
You can keep the core loop relatively small. You can add surrounding behavior through subscriptions instead of invasive rewrites. You can let repeated work teach the system something about method.
The system still gets more capable, but not because you stuffed every possible rule into the loop on day one.
That feels like a healthier direction to me than either extreme.
What I wish I had internalized earlier
I would have spent less time obsessing over prompts and more time cleaning up the runtime.
I would have added an event bus earlier.
I would have treated failure isolation as a core requirement instead of something to patch in once the dashboard or memory layer broke.
And I would have taken procedural memory more seriously sooner, even in its dumb form.
Not because the current implementation is magic. It is not.
But because it points at something that feels more durable than the usual cycle of prompt tweaks and framework debates.
The interesting question for me is no longer just "how smart is the model?"
It is:
What kind of body have you built around it?
Because that body determines whether the agent is just impressive in a screenshot, or useful once it has to live in the world.