In 2014, everyone was breaking their monoliths apart. "Just decompose the domain, give each service its own database, and you're done." The pitch was clean. The reality took the better part of a decade to stabilize — distributed tracing, service meshes, circuit breakers, saga patterns, idempotency keys. The graveyard of production incidents from those early Netflix and Uber war stories is long.
We are now at the exact same inflection point with multi-agent AI systems, and almost nobody is treating it with the same level of architectural seriousness. Everyone is excited about what agents can do. Almost nobody is thinking carefully about what happens when they fail.
I've spent the last eighteen months building multi-agent pipelines in production — systems where a dozen or more LLM-backed agents coordinate to accomplish tasks that no single model call could handle cleanly. Research pipelines, code generation workflows, autonomous data analysis systems. And I'm here to tell you: the distribution tax is real, it applies to agents, and the bill comes due in production.
The problems that killed your microservices aren't behind you. They're waiting for you at a new address.
The Structural Parallel Is Almost Exact
Let's be precise about the analogy because precision matters here. In a microservices architecture, you decompose a monolithic application into a set of independently deployable services. Each service has a bounded context, its own data store, and communicates with other services over a network boundary. The coordination overhead of that network boundary is the price you pay for the modularity and scalability you gain.
In a multi-agent system, you decompose a complex reasoning task into a set of independently operating agents. Each agent has a specialized capability — a research agent that browses the web, a synthesis agent that summarizes information, a critique agent that challenges assumptions, an executor agent that writes and runs code. They communicate by passing messages, artifacts, and shared state. The coordination overhead of that reasoning boundary is the price you pay for the parallelism and specialization you gain.
The shape of the problem is identical. You have autonomous units. You have communication between them. You have partial failures. You have state that needs to be consistent. You have latency that compounds. The solutions the distributed systems community developed over years for microservices are not perfectly portable, but they are the right conceptual starting point.
Orchestration vs. Choreography: The First Decision That Matters
This is the first architectural fork in the road for any multi-agent system, and I've seen teams get it wrong in both directions.
The Orchestration Model
In an orchestrated system, there is a central coordinator — often called the "planner" or "supervisor" agent — that explicitly directs the other agents. It knows the full plan upfront, assigns tasks, receives results, and decides what comes next. Think of it as a conductor. Nothing happens without the conductor's explicit instruction.
# Orchestrated agent loop (pseudocode)
class OrchestratorAgent:
def run(self, goal: str) -> Result:
plan = self.llm.plan(goal)
for step in plan.steps:
agent = self.agent_registry.get(step.agent_type)
result = agent.execute(step.task, context=self.shared_context)
if result.failed:
revised_plan = self.llm.replan(goal, failed_step=step, error=result.error)
return self.run_from(revised_plan, step.index)
self.shared_context.update(result.artifacts)
return self.synthesize(self.shared_context)
Orchestration is predictable. It's easy to trace. It's easy to debug because you have a single point of coordination and a clear execution log. The failure mode is brittleness — if the orchestrator's plan is wrong, the whole system is wrong, and replanning under failure is expensive and often unreliable. The orchestrator also becomes a single point of failure and a latency bottleneck.
The Choreography Model
In a choreographed system, there is no central coordinator. Agents subscribe to events or topics, react to outputs from other agents, and publish their own outputs. The overall behavior of the system emerges from the rules each individual agent follows. Think of it as a jazz ensemble — nobody is in charge, but if everyone knows the structure, it works.
# Choreographed agent event loop (pseudocode)
class ResearchAgent:
subscribes_to = ["task.research_requested"]
publishes_to = ["task.research_complete"]
def on_event(self, event: Event):
if event.type == "task.research_requested":
findings = self.search_and_summarize(event.payload.query)
self.publish("task.research_complete", {
"task_id": event.payload.task_id,
"findings": findings,
"confidence": findings.confidence_score
})
class SynthesisAgent:
subscribes_to = ["task.research_complete", "task.critique_complete"]
def on_event(self, event: Event):
# accumulate inputs, synthesize when quorum is reached
self.buffer[event.payload.task_id].append(event)
if self.has_quorum(event.payload.task_id):
self.synthesize_and_publish(event.payload.task_id)
Choreography scales better. Agents can run in parallel without contending on a central coordinator. The system is more resilient because there is no single point of failure. The failure mode is opacity — when something goes wrong, reconstructing what actually happened from the event stream is genuinely hard. And emergent behavior in LLM-based agents under edge cases is unpredictable in ways that Java microservices were not.
My practical recommendation: start with orchestration for anything that needs to be debugged and explained to stakeholders. Move toward choreography for well-understood, high-throughput workloads where the agents are more narrowly scoped. Mixing both — a choreographed mesh of orchestrated sub-pipelines — is where mature systems end up.
Agent Communication Patterns
The network boundary in microservices forced you to make a decision: synchronous HTTP or asynchronous messaging. The same decision exists for agents, and the tradeoffs are the same.
Synchronous Tool Calls
Most developer tooling today defaults to synchronous agent-to-agent calls. Agent A calls Agent B and waits for the response. This is simple to implement and easy to reason about. It is also the pattern most likely to cause cascading latency failures — if Agent B is slow or degraded, Agent A is blocked, and anything waiting on Agent A is blocked too. You have just reinvented the synchronous HTTP call chain that distributed systems engineers spent years learning to break apart.
Shared State via a Context Object
A common pattern in multi-agent frameworks is a shared "context" or "scratchpad" — a mutable object that all agents can read from and write to. This sounds convenient. It is an anti-pattern at scale. Shared mutable state in a distributed system is why we invented CRDTs and eventual consistency. A shared context object across agents that run in parallel gives you race conditions, stale reads, and conflicting writes. I've watched production agent systems produce contradictory final outputs because two agents simultaneously updated the same context key with different values and neither knew about the conflict.
# What this looks like going wrong:
# Agent A reads context.findings → empty, starts fresh research
# Agent B reads context.findings → empty, starts fresh research
# Agent A writes context.findings → ["result_a"]
# Agent B writes context.findings → ["result_b"] (overwrites A)
# Synthesizer reads context.findings → ["result_b"] (Agent A's work is gone)
# What you actually want:
context.findings.append_with_source(agent_id="agent_a", data=result)
context.findings.append_with_source(agent_id="agent_b", data=result)
# Synthesizer receives both, attributed, and can resolve conflicts explicitly
Message Queues and Event Buses
The robust pattern, particularly for choreographed systems, is an explicit message bus. Agents publish to named topics, subscribe to others, and the bus handles delivery guarantees, ordering, and backpressure. This is boring infrastructure. It is also correct. Redis Streams, Kafka, or even a simple Postgres-backed queue depending on your scale requirements. The bus gives you an audit log of every inter-agent communication for free, which matters enormously for debugging.
Failure Modes Nobody Warned You About
Microservices gave you well-understood failure modes: network partitions, timeout cascades, data inconsistency. Multi-agent systems have all of those plus a new class of failures that are unique to reasoning systems.
Hallucination Propagation
This is the failure mode that keeps me up at night. In a microservice, if Service A returns bad data, Service B either rejects it with a validation error or propagates it unchanged. The error is usually detectable. In a multi-agent system, if Agent A returns a confident-sounding hallucination, Agent B will reason about it as if it were true, produce a more elaborate hallucination that builds on the first, and pass that downstream. By the time the output reaches the user, you have a plausible-sounding confabulation that traces back to a single bad LLM call three steps earlier. I call this hallucination compounding, and it is the most dangerous failure mode in long-horizon agent pipelines.
The mitigation is explicit confidence gating — every agent output should carry a structured confidence signal, and downstream agents should have hard-coded skepticism thresholds that trigger verification steps rather than blind acceptance.
Goal Drift
In long-running pipelines, agents can drift from the original goal in ways that are individually rational but collectively wrong. A research agent, faced with a difficult search query, may reasonably expand the scope to find adjacent information. A synthesis agent, receiving that expanded information, may reasonably shift the framing of the summary. After three or four hops, the output addresses a subtly different question than the one originally asked. No single agent made an error. The system as a whole failed.
The fix is goal anchoring — every agent should receive not just the immediate task, but a read-only canonical copy of the original user goal, and should be prompted to evaluate its output against that goal before emitting it.
Token Budget Exhaustion Mid-Pipeline
This one sounds mundane but is surprisingly common. A long-running pipeline that looked fine in testing hits production data that is more verbose than expected. Intermediate agents accumulate context. By step seven of a twelve-step pipeline, you are hitting context window limits, and the agent's behavior degrades silently — it starts truncating, summarizing aggressively, or simply refusing to reason about the full context. You get a result, but it is a result produced under constraint you didn't account for.
Treat token budgets as first-class infrastructure. Track them. Alert on them. Architect your context-passing to be selective rather than cumulative by default.
Observability: The Hardest Part
Distributed tracing for microservices was a solved problem by 2018. OpenTelemetry gave you spans, traces, and structured logs that you could feed into Jaeger or Datadog. You could look at a trace and say definitively: this request spent 240ms in the payment service and 80ms in the inventory service.
Observability for multi-agent systems is not a solved problem. The challenge is that the interesting failures are semantic, not just operational. Latency and error rates are necessary metrics. They are not sufficient. You also need to know:
- Did Agent A's reasoning actually address the right sub-problem?
- Did the confidence scores reported by each agent reflect genuine epistemic state or were they defaulted?
- Where in the pipeline did the hallucination originate?
- Did goal drift occur, and at which agent?
- Was a particular output the product of actual reasoning or of a cached/shortcut path?
None of these are answerable from a span duration. You need structured intermediate outputs — a trace not just of time, but of reasoning. Every agent call should log the full prompt, the model response, the parsed structured output, and any confidence metadata in a queryable store. This is expensive in storage. It is less expensive than debugging a production incident where you have no idea what your agents were actually thinking.
# Agent trace record schema (pseudocode)
{
"trace_id": "uuid-v4",
"span_id": "uuid-v4",
"parent_span_id": "uuid-v4 | null",
"agent_id": "research_agent_v2",
"task": "find recent papers on transformer attention efficiency",
"goal_anchor": "summarize the state of LLM efficiency research for a CTO",
"model": "claude-sonnet-4-6",
"prompt_tokens": 1842,
"completion_tokens": 634,
"latency_ms": 3241,
"output": {
"findings": [...],
"confidence": 0.82,
"sources_verified": true,
"goal_alignment_check": "PASS"
},
"raw_completion": "...", // full LLM output, for post-hoc debugging
"timestamp": "2026-02-07T14:23:01Z"
}
The Agent Mesh Concept
In the microservices world, a service mesh (Istio, Linkerd) sits as a transparent infrastructure layer between services. It handles mTLS, load balancing, circuit breaking, and retry logic — without any individual service needing to implement those concerns. The services communicate as if they are talking directly to each other. The mesh handles everything else.
The same pattern is emerging for multi-agent systems. I think of it as the agent mesh — a transparent coordination layer that sits between agents and handles the cross-cutting concerns that you don't want to bake into each agent's implementation:
- Routing: Directing agent outputs to the right downstream consumers based on content type, confidence level, or task classification — without the emitting agent needing to know who the consumers are.
- Circuit breaking: If a specialized agent is failing consistently, stop calling it and fall back to a generalist path rather than compounding failures downstream.
- Rate limiting and cost control: LLM calls are expensive. An agent mesh can enforce per-agent token budgets, rate limits, and cost ceilings without polluting agent logic with billing concerns.
- Semantic deduplication: In parallel agent architectures, two agents often produce functionally identical outputs via different reasoning paths. A mesh layer can deduplicate before passing downstream, reducing noise in synthesis steps.
- Retry and fallback: Automated retry with model fallback (e.g., retry a failed
claude-opus-4-6call withclaude-sonnet-4-6at lower cost) is best handled at the mesh layer, not inside each agent.
We don't yet have an Istio for agents. The tooling is catching up. But the architectural concept is sound and the teams I've seen succeed at scale are building this layer explicitly — even if it is custom code — rather than letting each agent accumulate its own ad hoc resilience logic.
Production Deployment Patterns
Shipping a single LLM call to production is comparatively easy. You add a retry, you add a timeout, you log the response. Shipping a multi-agent pipeline to production requires you to think about deployment at the system level, not the agent level.
Canary Releases for Agent Pipelines
When you update a single agent in a multi-agent pipeline, you are not updating a microservice in isolation. You are changing a component of a coupled reasoning system. An agent that behaves identically in unit testing may behave differently in the context of a real pipeline, because its behavior is conditioned on the outputs of upstream agents. Canary releases — sending a small percentage of traffic through the updated pipeline while the rest uses the stable version — are essential, and they need to evaluate semantic quality, not just error rates. Ship your agent updates with automated evals that run on real production traffic samples before broadening rollout.
Stateless vs. Stateful Agents
The stateless service ideal of microservices — any instance can handle any request, no local state — is a good default for agents too. Agents that accumulate state across calls become harder to scale, harder to restart after failure, and harder to reason about. Where state is necessary — a long-running workflow that spans hours — externalize it explicitly into a durable state store (Postgres, Redis, a workflow engine) rather than keeping it in-process. The agent should be able to reconstruct its working context entirely from the external store on any restart.
Idempotency
LLM calls are not idempotent by nature — running the same prompt twice often produces different outputs. But your agent actions — writes to databases, calls to external APIs, emails sent — absolutely must be idempotent. Every action taken by an agent should carry a stable idempotency key derived from the task context and the agent's step in the pipeline. If the pipeline retries, the action is a no-op rather than a duplicate side effect. This is basic distributed systems hygiene that agent frameworks often omit entirely.
The Human-in-the-Loop Checkpoint
Not every step in a multi-agent pipeline should run fully autonomously in production. For high-stakes decisions — financial actions, external communications, irreversible data mutations — insert explicit human approval checkpoints into the pipeline design. This is not a failure of the agentic vision. It is sensible risk management that mirrors the approval workflows you already have in your non-AI systems. Design the checkpoint as a first-class step in the pipeline, not an afterthought. Agents should be able to emit a structured "requires approval" output that suspends the pipeline and routes to a human review queue.
Where This Is All Going
Microservices took roughly a decade to mature — from the initial wave of decomposition in 2012-2015, through the chaos of poorly-instrumented distributed systems, to the relative stability of the service mesh era by 2020-2022. The tooling eventually caught up to the architectural ambitions.
Multi-agent systems are compressing that timeline, but they are not skipping it. The teams that are building multi-agent infrastructure seriously right now — investing in agent tracing, building coordination layers, thinking hard about failure modes — are going to look like the engineering orgs that invested in Kubernetes in 2016. The teams that are shipping spaghetti agent pipelines with no observability and no failure handling are going to look like the teams that had 50 microservices and no distributed tracing in 2017. The reckoning comes later, not never.
The good news is the conceptual work is already done. Distributed systems is a mature field. The patterns — circuit breakers, bulkheads, event sourcing, sagas, idempotency, canary releases — are well understood and well documented. The translation from microservices to multi-agent systems is not trivial, but it is tractable. You do not need to learn new principles. You need to apply existing principles to a new substrate.
Start there. Take your systems design knowledge seriously. Build the boring infrastructure. Your agents will be smarter for it.