What Hiero Taught Me About Production Agents

2025-01-31 · agents, hiero, langgraph, production, web3

Lessons from a year running a multi-chain AI agent terminal across Solana and Base — what works, what doesn't, and where the real engineering effort goes once you hand an agent a wallet, a toolbox, and an audience.

For the past year at Hiero, we've been running a multi-chain AI agent terminal across Solana and Base. Users could spin up token-attached agents and hand them a fairly wide toolbox: token launches, swaps, transfers, on-chain lookups, price research, autonomous Twitter, recurring tasks. We shipped the platform and the HTERM token.

Production agents, it turns out, aren't chatbots with extra tools. They're autonomous systems with budgets, audiences, and the capacity to embarrass you. The bottleneck hasn't been capability. It's been operational discipline.

The premise

The thesis at Hiero was easy to describe and harder to ship: give an autonomous AI agent a wallet, a toolbox, and an audience, and let it do useful things on the user's behalf across blockchains. The toolbox was deliberately broad — DeFi primitives across Solana and Base, token creation and trading, on-chain analytics, market 'deep' research, recurring schedules, social posting. We wanted to find out what was actually possible once an agent had the means.

What I've learned is that "what is possible" was the wrong frame. By any reasonable definition, modern agents could do almost everything we threw at them. The harder question — the one we spent most of our engineering hours on — was always: when an agent does something autonomously with real consequences, what catches it when it goes wrong?

Organising many tools, and the context that flows through them

Hiero exposed a fairly large surface of tools to its agents. Listing them out, the toolbox looks generous; wiring them up to a single agent, you realise you have created a coordination problem the LangChain quickstart does not warn you about.

The first issue is mechanical. Every tool has its own input schema, output schema, side effects, and context requirements. Connecting dozens of tools to one agent means dozens of schemas the model has to reason over, dozens of descriptions taking up the context window, and dozens of places where a misalignment between what a description says and what a tool actually does can produce a bad call. Tool description authoring is its own discipline — verbose enough to be unambiguous, sparse enough to fit, calibrated enough that the model selects correctly.

The second issue is structural. Beyond a certain number of tools, a single flat list stops being workable; the model spends too much of its attention on tool selection and not enough on the actual task. We moved to a hierarchical structure — an orchestrator agent delegating to specialist sub-agents, each only seeing the tools relevant to its slice of the problem. That helped, but introduced its own state-management problem, which is the next beat.

The third issue is context bleed: information passed in through one tool's output quietly shaping the agent's reasoning about a different tool, in ways that are difficult to predict and harder to debug. Hard boundaries between tool clusters, and between the sub-agents that own them, were the only thing that kept this manageable. The constraint angle — "what should the agent be unable to do" — sits inside this larger organisation problem, not separate from it.

State across sub-agents and tasks

The agent runtime — the layer that plans, calls tools, tracks state, and stays observable — is where production agents live or die. We built ours on LangChain JS, LangGraph for the multi-step research-and-execution graphs, and LangSmith for observability. The graph pattern that survived was a recursive research-then-act loop with explicit state nodes.

Inside a single LangGraph run, state is well-managed — the graph carries it. The harder problem, in production, is state across sub-agents and across tasks. The moment you decompose a problem into sub-agents — an orchestrator passing work to a specialist — state becomes a translation problem at every handoff. What does the orchestrator pass down? In what schema? At what granularity of summary versus full context? And when the sub-agent finishes, what comes back? We iterated through several patterns and never landed on one that felt completely right.

Recurring tasks compound the problem. An agent that runs every hour to check on positions, or every morning to review the previous day's activity, needs to know what previous runs did. Without persistent state across runs, every invocation starts cold. We patched this with database-backed task histories the agent could query — bespoke per task type, never quite right.

Long-term memory is the silent failure mode

Beyond the state that flows within a graph or between sub-agents, there is a deeper structural problem: agents need memory that persists across sessions, across runtimes, and ultimately across platforms. None of the production memory systems I know of solve this cleanly.

Every new session, the agent reset. Cross-session continuity was patched by extracting context from previous interactions, summarising it, and injecting it back into the next prompt. It worked, after a fashion. It never felt right. Summaries are lossy. Critical context is, invariably, the part the summariser drops.

Observability is the foundation

A class of question we kept getting from people evaluating Hiero was: "how do you make sure the agent behaves correctly?" The honest answer was always some version of "we cannot, and trying to is the wrong question."

Agents are non-deterministic. You cannot make them reliable in the way a sorting algorithm is reliable. What you can do is make them inspectable. Every run traced. Every tool call logged. Every state transition recorded. When something goes wrong the question that matters is not "why did the model do that," which is rarely answerable, but "can I see exactly what the agent saw and did, in order, and replay it." That, in practice, is what determines whether you fix the problem before the next user hits it.

LangSmith earned its keep here. We learnt to debug agent runs the way you debug distributed systems: trace, replay, blame. The teams I've seen struggle hardest with production agents are the ones who tried to engineer determinism into the agent rather than engineer observability around it. Observability isn't a feature you bolt on later. It's the foundation that makes everything else debuggable.

The wallet changes everything

There's a moment, when you grant an agent the ability to sign on-chain transactions, when the operational stakes flip. Pre-wallet, the worst the agent can do is say something wrong. Post-wallet, the worst it can do is lose your money — or your users' money, which is worse.

That moment changes every design decision upstream. Authorisation becomes the design problem, not capability. What is this agent permitted to spend, on what, in what window, and how does the user revoke that permission cleanly? Hiero's answer was scoped per-agent budgets, allow-listed tool surfaces, and human checkpoints above configurable thresholds. It was ad hoc, it was functional, and it was clearly not the long-term answer.

The long-term answer probably looks like the patterns DeFi has been hardening for half a decade — approve and allowance semantics, smart-contract wallet policies, session keys, account abstraction. Those primitives map almost directly onto the agent-spending problem.

What I'd do differently

If I were starting Hiero again, the order would be different.

Wallet authorisation first — before any tool, before any agent persona, before any token. Most of what made the day-to-day painful traced back to building features against a wallet that didn't have proper policy primitives, then patching the gap with bespoke checkpoints. The pattern is hard to undo once it's there.

Observability right alongside it — really part of the same beat. LangSmith from day one is the one decision I wouldn't change. Trace everything. Make every run replayable. Without that, nothing else is debuggable.

Memory third, but with a different mental model than the one we started with: as a structural property of the runtime, not a feature you retrofit. Persistent, portable, agent-controlled state needs to be there before the first tool ships, not bolted on six months in.

Three orderings, really one principle: think about what catches the agent before you think about what lets it act.

Looking back

Hiero was a great engineering challenge. The operational discipline required to ship autonomous systems with real consequences is its own thing, and we were learning it as we built. I'm exited about how this field is evolving and looking forward to learning more about this awesome agentic art.