Multi-agent orchestration is the discipline of coordinating two or more semi-autonomous LLM agents — each with its own prompt, tools, and sometimes its own model — so they collaborate on a task that one agent could not reliably finish alone. You need it only when the work has genuinely separable sub-problems, parallelizable branches, or competing roles that benefit from isolation; for most workloads the honest answer is that you do not, and a single agent with a good tool set is the better build.

That deserves saying plainly, because the common failure here is not "I picked the wrong topology" — it is "I built a multi-agent system at all." Anthropic's engineering writing on building effective agents makes the same case: reach for multiple autonomous agents only when a single LLM call or a simple workflow falls short. This article assumes you have passed that test. We will walk the core patterns — supervisor, swarm, pipeline, blackboard, and routing — then cover coordination, token economics, failure modes, observability, and the build-vs-framework decision. For the groundwork on what an agent is, see our agentic AI pillar.

When Multi-Agent Is Justified — and When It Is Over-Engineering

The justification test is mechanical. A multi-agent system earns its complexity when at least one of these holds:

  1. Context isolation pays off. Sub-tasks need disjoint, large contexts that would blow a single window or poison reasoning if combined (one agent reads contracts, another reads telemetry).
  2. Genuine parallelism exists. Independent branches run concurrently and the wall-clock win matters — map-reduce over many documents.
  3. Roles are adversarial or specialized. You want a generator and a critic, or a planner and an executor, kept apart so one cannot rationalize the other's mistakes.
  4. Tool surfaces are too large for one agent. Hundreds of tools degrade selection accuracy; partitioning them across role-scoped agents restores it. (An agent gateway with semantic tool discovery is the other lever.)

If none of those hold, you are over-engineering. The tells are familiar: a "researcher agent" and a "writer agent" that always run in fixed sequence and never branch (that is a pipeline, often just two prompts); five agents that each make one tool call and hand back. Each boundary costs latency, tokens, and a new place to wedge — multi-agent is a tax you pay for isolation and parallelism, not for org-chart aesthetics.

The Core Multi-Agent Orchestration Patterns

Five patterns cover almost everything; the comparison table that follows summarizes them.

Supervisor / hierarchical (orchestrator-worker)

One orchestrator agent is in charge: it decomposes the task, dispatches sub-tasks to specialized workers, and synthesizes their outputs into a final answer. Workers do not talk to each other; coordination flows through the center. This is the topology Anthropic described in its multi-agent research system, where a lead agent spawns subagents to explore a query in parallel, then composes the result. Topology: star/tree, control and state centralized.

Use when the task decomposes into independent sub-tasks and you want parallel fan-out with one synthesis point and one place to enforce policy, budget, and stopping conditions; it is the most predictable and observable dynamic pattern. Avoid when sub-tasks are tightly interdependent and must negotiate mid-flight (the supervisor becomes a chatty relay), or its context balloons holding every worker's output to synthesize. You can nest supervisors, but resist going past two tiers — each layer multiplies cost and makes failure attribution harder.

Swarm / peer hand-off

There is no central boss. Agents are peers, and control transfers by hand-off: the active agent decides another is better suited and passes the conversation — accumulated context included — to it. OpenAI's Swarm experiment and its successor, the OpenAI Agents SDK, popularized this, where a hand-off is literally a tool the agent calls that returns the next agent to run. Topology: decentralized graph; control is a moving token.

Use when the right specialist is best decided in-flow (a triage agent routing to billing vs. technical vs. refunds), or you want each agent to carry a tight, role-specific prompt and delegate when the topic shifts. Avoid when you need global coordination, strict ordering, or a guaranteed single synthesis step — swarms make ownership of the final answer ambiguous and are the pattern most prone to infinite hand-off loops unless you bound them.

Sequential pipeline

The simplest multi-agent shape: a fixed, ordered chain where each stage's output is the next stage's input — extract → transform → summarize → format. No dynamic routing, no negotiation. Topology: linear DAG, static.

Use when stages always run in the same order, you want the easiest debugging (each stage is independently testable and cacheable), and you can stream between stages. Pipelines are also the easiest pattern to slot deterministic guards into — schema validation, a regex gate, a human approval — without an LLM deciding whether to run them. Avoid when tasks need branching or back-tracking: the moment a stage should skip ahead, loop back, or fan out, a rigid pipeline misbehaves or grows a thicket of conditionals — reach for routing or a supervisor.

Blackboard / shared-state

Borrowed from classic AI, the blackboard replaces direct messaging with a shared data structure all agents read from and write to. Agents fire when the state contains something they can act on and contribute their piece until a solution emerges — no agent addresses another directly. Topology: hub around a shared store; coordination is data-driven. In LangGraph terms, the graph's shared state object is effectively a blackboard.

Use when many specialists contribute partial results to a problem whose solution path is not fixed (diagnostic reasoning, iterative planning, design-space search), or you want loosely coupled, swappable agents that read common state. Avoid when you need clear sequencing or accountability — blackboards make control flow implicit, and concurrent writes create contention and consistency bugs that need explicit reducers or locking (real engineering, not a prompt tweak).

Routing

A thin, high-value pattern: a classifier — a cheap model or plain rules — inspects the input and dispatches it to exactly one downstream agent or chain, with no further coordination. Anthropic lists routing as a foundational workflow because it lets you specialize without paying for a full orchestrator. Topology: one-to-one selection.

Use when inputs fall into distinct categories that each need a different prompt, tool set, or even a different-sized model (simple queries to a small fast model, hard ones to a frontier model), keeping each specialist prompt clean. Avoid when categories overlap heavily or one input needs several specialists at once — then you want a supervisor or a swarm.

Pattern comparison

PatternTopologyControl flowUse whenMain risk
Supervisor / hierarchicalStar / treeCentralized in orchestratorDecomposable tasks, parallel fan-out + single synthesis, central policySupervisor context bloat; deep nesting kills observability
Swarm / peer hand-offDecentralized graphMoving token via hand-offsRight specialist decided in-flow; tight role-scoped agentsInfinite hand-off loops; ambiguous ownership of final answer
Sequential pipelineLinear DAG (static)Fixed stage orderStable known stages, max predictability, streamable throughputNo branching; collapses to over-engineering if stages are trivial
Blackboard / shared-stateHub around shared storeData-driven activationMany specialists, unfixed solution path, loose couplingImplicit control flow; write contention and consistency bugs
RoutingOne-to-one selectionSingle dispatch decisionDistinct input categories, model/tool specializationBreaks when one input needs multiple specialists

These patterns compose: a production system is often routing at the front, a supervisor behind one route, and a pipeline inside a worker. Name which pattern governs each boundary rather than letting it emerge by accident.

Coordination and State: Shared Memory, Message Passing, and A2A

Once you have more than one agent, the hard part is no longer the prompts — it is how state and messages move. Three mechanisms cover it.

  • Shared memory / shared state. Agents read and write a common store (the blackboard, a LangGraph state object, a vector or key-value store). Low coordination overhead, but you own consistency: define reducers, decide last-write-wins vs. merge, guard concurrent writes. Persistence here also gives the run durability and resumability.
  • Message passing. Agents exchange explicit messages — request/response or hand-off payloads — which is how supervisor dispatch and swarm hand-offs work under the hood. Easier to trace than shared state but couples agents to each other's contracts.
  • A2A and agent gateways. When agents live in different processes, teams, or trust domains, you need a wire protocol and a control point. The Google Agent2Agent (A2A) protocol — now under the Linux Foundation's agent-interoperability effort — standardizes how agents discover one another (via an Agent Card), authenticate, and exchange tasks over HTTP, independent of framework. Pair it with the Model Context Protocol (MCP) for the tool layer: A2A is agent-to-agent, MCP is agent-to-tool, and most real systems use both. A gateway terminates this traffic so identity, authorization, rate limiting, and audit live in one place instead of N point-to-point trust relationships. For the tool side, see MCP gateway auth and discovery; for the registry side, the agent registry.

Keep coordination as local as you can: in-process agents sharing a typed state object need a clean reducer, not A2A. Reach for A2A and a gateway only when the boundary is organizational or cross-runtime — ASCENDING's Jarvis Registry agent gateway is one implementation of exactly this control point.

The Context and Token Cost of Agent-to-Agent Chatter

Every agent boundary has a price, mostly paid in tokens. When agent A calls agent B, B receives a fresh system prompt, tool definitions, and forwarded context; its answer flows back into A's context for synthesis. Multiply that across fan-out and hand-offs and the cost is non-linear: Anthropic reported its multi-agent research system consumed roughly fifteen times the tokens of a single chat — worth internalizing before you fan out to a dozen subagents.

The cost drivers are predictable: context duplication (shared background re-sent to each agent), synthesis overhead (a supervisor holding every worker's output in its window), hand-off accumulation (swarm conversations grow with each pass), and chatter loops. Mitigations are equally concrete: pass summaries or structured artifacts across boundaries instead of raw transcripts; cap fan-out width and hand-off depth; give workers tight, role-scoped tool sets; and prompt-cache the stable prefix each agent re-sends. Budget the run as a whole, not per agent.

Failure Modes and Mitigations

Multi-agent systems fail in ways single agents do not — design for these from the start.

  1. Cascading errors. One agent's wrong or hallucinated output becomes another's trusted input and the mistake compounds. Mitigate: validate at boundaries (schema checks, assertions, a critic agent); never treat an upstream claim as ground truth without verification where it matters.
  2. Infinite hand-off / delegation loops. A delegates to B, B back to A, forever. Mitigate: a global hand-off counter and max-step budget per run; detect repeated A→B→A cycles and break to a fallback or human.
  3. Deadlocks and stalls. Agents wait on each other or on shared state that never arrives; blackboards can quiesce with no agent able to act. Mitigate: timeouts on every inter-agent call, a watchdog that fails the run on no state progress, and a defined terminal condition.
  4. Cost blow-ups. Recursive spawning or chatter loops run up an unbounded bill. Mitigate: a hard token/dollar budget at the orchestrator (and ideally the gateway), aborting on breach.
  5. Lost-in-the-middle / context poisoning. Over-long forwarded context buries the instruction that matters, or one agent's bad framing biases the rest. Mitigate: forward summaries, keep contexts disjoint, and isolate adversarial roles.
  6. Partial-failure ambiguity. One worker fails; retry, route around it, or fail the task? Mitigate: decide failure semantics per worker up front — required vs. best-effort — and make degradation explicit, not emergent.

The unifying principle: bound everything — steps, hand-offs, depth, time, and money all need explicit ceilings. An unbounded multi-agent system is not autonomous; it is a runaway.

Evaluation and Observability of Multi-Agent Runs

A multi-agent run is far harder to see than a single completion. The non-negotiable foundation is end-to-end tracing: every agent invocation, tool call, hand-off, and state mutation captured as a span in one correlated trace, so you can reconstruct who did what, in what order, and why control moved. OpenTelemetry's GenAI semantic conventions and the trajectory-inspection features in platform observability stacks exist for exactly this — our agent observability glossary entry goes deeper.

Beyond tracing, evaluate at two levels:

  • Trajectory (process) evaluation. Did the system take a sane path — correct routing, no needless hand-offs, no loops, bounded cost? Process metrics catch regressions output-only scoring misses, like a run that returns the right answer but now burns 3x the tokens.
  • Outcome (task) evaluation. Did the final result meet the bar? Because outputs are open-ended, lean on LLM-as-judge rubrics, golden-set comparisons, and sampled human review — and, as Anthropic notes, evaluate the end state rather than every intermediate step, since agents legitimately reach good answers by different routes.

Track per-run token and dollar cost, hand-off and step counts, per-agent latency and error rates, and loop/deadlock incidence as first-class SLOs. If those are not on a dashboard, the first you hear of a regression is the invoice.

Build vs. Framework: A Note Without Hype

By what each tool is actually good at:

  • LangGraph (LangChain) — a graph/state-machine library with explicit nodes, edges, and shared state, plus durable execution and checkpointing. Best for precise, debuggable control over supervisor or blackboard topologies.
  • CrewAI — a role-and-task framework: declare agents with roles and goals, assemble them into crews with sequential or hierarchical process. Faster to stand up than a raw graph.
  • AutoGen (Microsoft) — conversation-centric, strong for chat-style collaboration and human-in-the-loop. Microsoft is consolidating it with Semantic Kernel into a unified Agent Framework, so check current status first.
  • OpenAI Agents SDK — a lightweight, production-minded successor to the Swarm experiment; hand-offs and guardrails are first-class, a natural fit for the swarm pattern.
  • Google A2A — not a framework but the interoperability protocol; reach for it (and a gateway) when agents talk across frameworks, processes, or organizations.

The honest decision rule: build it yourself when the logic is simple — routing and short pipelines are often a few functions and a switch statement, and a framework's abstractions cost more than they save. Adopt a framework when you need its hard parts — durable state, checkpointing, replay, structured hand-offs, human-in-the-loop. Treat the protocol layer (A2A, MCP) as orthogonal to the framework choice.

Frequently Asked Questions

When should I use a multi-agent system instead of a single agent with tools?

Use multiple agents only when the task has separable sub-problems that benefit from context isolation, genuine parallelism, or adversarial/specialized roles, or when one agent's tool surface is too large for reliable selection. If none of those hold, a single agent with a good tool set is cheaper, faster, and easier to debug — and most "multi-agent" designs should be exactly that.

What is the difference between supervisor and swarm orchestration?

A supervisor (orchestrator-worker) centralizes control: one agent decomposes the task, dispatches to workers, and synthesizes the result, with workers never talking directly. A swarm is decentralized: peer agents transfer control by handing off the conversation to whichever specialist fits next, with no single coordinator. Supervisors are more predictable and observable; swarms adapt better when the right specialist is only known in-flow.

How do I prevent infinite hand-off loops in a multi-agent system?

Enforce a global maximum step count and a hand-off counter per run, and abort or escalate to a human when either is exceeded. Add cycle detection for repeated A→B→A patterns, put timeouts on every inter-agent call, and bound the total token/dollar budget at the orchestrator so a loop cannot run up an unbounded bill before it is caught.

What is the A2A protocol and how does it relate to MCP?

The Agent2Agent (A2A) protocol, originally from Google and now under the Linux Foundation, is an open standard for how independent agents discover each other (via an Agent Card), authenticate, and exchange tasks over HTTP regardless of framework. MCP (Model Context Protocol) is complementary but at a different layer, standardizing how an agent connects to tools and data. A2A is agent-to-agent; MCP is agent-to-tool; production systems commonly use both.

Why do multi-agent systems cost so many more tokens?

Each agent boundary re-sends a system prompt, tool definitions, and forwarded context, and synthesis agents must hold every worker's output in their own window — so cost grows non-linearly with fan-out and hand-off depth. Anthropic reported its multi-agent research system used roughly fifteen times the tokens of a single chat. Mitigate by forwarding summaries instead of transcripts, scoping tools tightly, and budgeting the whole run rather than per agent.

Which framework should I use for multi-agent orchestration?

Match the tool to the topology: LangGraph for explicit, debuggable graph/state control with durable execution; CrewAI for fast role-based crews; AutoGen for conversation-driven, human-in-the-loop collaboration; and the OpenAI Agents SDK for swarm-style hand-offs. Build it yourself when the logic is simple routing or a short pipeline, and adopt a framework only when you need its hard parts — durable state, checkpointing, replay, structured hand-offs.

Citations and References

  1. Anthropic — "Building effective agents": https://www.anthropic.com/engineering/building-effective-agents
  2. Anthropic — "How we built our multi-agent research system": https://www.anthropic.com/engineering/built-multi-agent-research-system
  3. Google / Linux Foundation — Agent2Agent (A2A) protocol: https://a2a-protocol.org
  4. Model Context Protocol — specification and docs: https://modelcontextprotocol.io
  5. LangGraph — documentation and repository: https://github.com/langchain-ai/langgraph
  6. CrewAI — documentation and repository: https://github.com/crewAIInc/crewAI
  7. Microsoft AutoGen — documentation and repository: https://github.com/microsoft/autogen
  8. OpenAI Agents SDK (successor to Swarm) — documentation and repository: https://github.com/openai/openai-agents-python
  9. OpenTelemetry — GenAI semantic conventions: https://opentelemetry.io/docs/specs/semconv/gen-ai/