TL;DR: Pattern selection drives everything: supervisor-worker for complex decomposable tasks, pipeline for sequential handoffs, swarm/fan-out for embarrassingly parallel work. Google's A2A protocol standardizes cross-agent discovery and delegation via AgentCard manifests and task delegation over HTTP+SSE. MCP handles tools, A2A handles agents — they compose cleanly at the boundary between capabilities and agents. Swarm coordination requires explicit consensus or aggregation layers; without them you get racing writes and silent conflicts. Error handling in multi-agent systems is fundamentally different from single-agent error handling — failures cascade, retries can double-execute, and partial completion is the common case. Hierarchical systems add coordination overhead that only pays off above 4-5 agents — below that, flat coordination wins.
Multi-agent orchestration is where most AI projects stall. Individual agents are tractable — you prompt an LLM, wrap it in a tool loop, and ship something useful. But the moment you need multiple agents to coordinate, share state, delegate sub-tasks, and handle each other's failures, you're operating in genuinely hard distributed systems territory. This article is a technical pattern catalog for engineers building multi-agent systems: when to use swarm vs. supervisor-worker vs. pipeline architectures, how Google's A2A protocol enables cross-agent discovery, how MCP and A2A compose together, and what production failure modes look like with concrete mitigations.
Note: This article focuses on the engineering architecture of multi-agent systems. For the product design angle — how to design B2B products consumed by agents, semantic APIs, and the business case for agent-first architecture — see Multi-Agent Orchestration: How to Design Products Where AI Agents Are the Primary Users.
Table of Contents
- The Orchestration vs. Choreography Divide
- Pattern Selection: A Decision Framework
- Google's A2A Protocol: Agent Discovery at Scale
- MCP + A2A: Tools and Agents Together
- Swarm Architecture: Fan-Out, Map-Reduce, and Voting
- Supervisor-Worker Pattern
- Pipeline Pattern: Sequential Handoffs
- Hierarchical Multi-Agent Systems
- Error Handling in Multi-Agent Systems
- Production Examples
- FAQ
Orchestration vs. Choreography
Before picking a pattern, you need to decide whether your system is orchestrated or choreographic. The distinction matters because it determines where coordination logic lives — and therefore where bugs are easiest to find and fix.
Orchestration means a central coordinator (the orchestrator) knows the full workflow and directs each agent explicitly. The orchestrator tells Agent A to run, waits for results, then tells Agent B what to do next. All workflow logic is in one place. This makes the system easy to reason about, debug, and change. The tradeoff: the orchestrator is a single point of failure and a potential bottleneck.
Choreography means each agent knows only its own role and reacts to events. Agent A completes its work, emits an event, and Agent B — listening to that event queue — picks up automatically. There's no central coordinator. This is more resilient and scalable, but significantly harder to debug because workflow logic is distributed across every agent's event subscriptions.
In practice, most production multi-agent systems are hybrid: orchestrated at the macro level (a supervisor knows the high-level plan) and choreographic at the micro level (agents react to shared state changes within their sub-domain). The patterns below span this spectrum.
The key practical rule: start with orchestration, graduate to choreography only when you hit scale constraints. Choreographic systems require sophisticated distributed tracing to debug — invest in that infrastructure first or you'll spend weeks on incidents you can't reproduce.
Pattern Selection: A Decision Framework
Choosing the wrong pattern is the most common multi-agent architecture mistake. Here's a decision framework based on four variables: task decomposability, result dependency, failure tolerance, and coordination overhead budget.
flowchart TD
A[Task arrives] --> B{Can task be decomposed\ninto independent subtasks?}
B -->|No - strictly sequential| C{Does each step depend\non prior step's output?}
B -->|Yes - parallelizable| D{Do results need\nconsensus/aggregation?}
C -->|Yes| E[Pipeline Pattern\nSequential handoffs\nLow overhead, easy debug]
C -->|No| F[Single agent\nwith tools]
D -->|Yes - voting/merge needed| G{How many agents?}
D -->|No - independent outputs| H[Fan-out / Fire-and-forget\nParallel Swarm]
G -->|2-4 agents| I[Swarm with\naggregator node]
G -->|5+ agents| J[Map-Reduce Swarm\nDistributed aggregation]
A --> K{Is the task a complex\nhigh-level goal?}
K -->|Yes - needs decomposition| L{Predictable decomposition\nor dynamic planning?}
K -->|No| B
L -->|Predictable - known steps| M[Pipeline Pattern]
L -->|Dynamic - LLM plans steps| N{Single coordinator\nor multi-level?}
N -->|Under 5 worker agents| O[Supervisor-Worker\nFlat hierarchy]
N -->|5+ workers or nested goals| P[Hierarchical Multi-Agent\nSupervisor tree]
style E fill:#2563eb,color:#fff
style H fill:#16a34a,color:#fff
style I fill:#16a34a,color:#fff
style J fill:#16a34a,color:#fff
style M fill:#2563eb,color:#fff
style O fill:#9333ea,color:#fff
style P fill:#9333ea,color:#fff
The decision tree above maps to four pattern families:
We cover each in depth below.
Google's A2A Protocol: Agent Discovery at Scale
The Google Agent-to-Agent (A2A) protocol is an open specification published in April 2025 that standardizes how agents discover each other, advertise capabilities, and delegate tasks. Before A2A, agent-to-agent communication was proprietary per framework — an Anthropic agent couldn't cleanly call a LangGraph agent without custom glue code. A2A fixes this at the protocol layer.
The AgentCard
The central primitive in A2A is the AgentCard — a JSON manifest that an agent exposes at a well-known URL (/.well-known/agent.json) describing what it can do, how to call it, and what authentication it requires. Think of it as an OpenAPI spec for agents.
{
"name": "document-analysis-agent",
"description": "Analyzes documents for key entities, sentiment, and action items",
"version": "1.2.0",
"url": "https://agents.example.com/document-analysis",
"capabilities": {
"streaming": true,
"pushNotifications": true,
"stateTransitionHistory": true
},
"authentication": {
"schemes": ["Bearer"]
},
"skills": [
{
"id": "analyze-document",
"name": "Analyze Document",
"description": "Extract entities, sentiment, and action items from a document",
"inputModes": ["text", "file"],
"outputModes": ["text", "structured"],
"tags": ["nlp", "document-processing", "extraction"]
},
{
"id": "compare-documents",
"name": "Compare Documents",
"description": "Diff two documents and summarize changes",
"inputModes": ["file"],
"outputModes": ["text", "structured"],
"tags": ["document-processing", "comparison"]
}
],
"defaultInputMode": "text",
"defaultOutputMode": "text"
}
An orchestrating agent discovers available agents by fetching their AgentCards — either from a registry or from known URLs. It then selects the right agent based on skill tags and capability match, without any custom routing code.
Task Delegation and the Three-Agent Framework
A2A defines a task lifecycle with explicit states: submitted → working → input-required → completed | failed | canceled. Tasks are submitted via POST /tasks/send and tracked via GET /tasks/{id}. Long-running tasks stream updates via Server-Sent Events.
The three-agent framework Google illustrates in the A2A spec shows the pattern in practice: a user-facing agent (the client agent), a coordinating agent (the host agent), and specialist execution agents (the remote agents). The client agent receives user intent and delegates to the host, which selects and orchestrates remote agents based on AgentCards.
sequenceDiagram
participant U as User / Client App
participant CA as Client Agent
participant HA as Host Agent
participant RA1 as Remote Agent 1\n(Document Analysis)
participant RA2 as Remote Agent 2\n(Data Enrichment)
participant R as Agent Registry
U->>CA: "Analyze these contracts and enrich company data"
CA->>HA: POST /tasks/send {skill: "multi-step-analysis", input: contracts}
HA->>R: GET /.well-known/agent.json (fetch AgentCards)
R-->>HA: AgentCard[] — capabilities manifest
HA->>HA: Select agents by skill tags
HA->>RA1: POST /tasks/send {skill: "analyze-document", input: contracts}
HA->>RA2: POST /tasks/send {skill: "enrich-company", input: company_names}
RA1-->>HA: SSE stream: working → completed {entities, action_items}
RA2-->>HA: SSE stream: working → completed {firmographic_data}
HA->>HA: Synthesize results
HA-->>CA: POST /tasks/{id}/updates {state: "completed", result: {...}}
CA-->>U: Final synthesized response
The key engineering advantages of A2A over ad-hoc agent calling:
- Capability-based routing — the host agent selects agents based on their declared skills, not hardcoded logic
- Standardized task state — all agents use the same task lifecycle, so monitoring and debugging work uniformly
- Streaming by default — SSE transport means long-running tasks don't block and progress is visible
- Cross-framework interop — an agent built on LangGraph can call an agent built on CrewAI without custom adapters, as long as both expose A2A-compliant endpoints
The A2A spec is still evolving. As of early 2026, the main gaps are around agent-to-agent trust negotiation (currently relies on standard OAuth but doesn't specify agent identity attestation) and distributed task cancellation (canceling a host task doesn't automatically cascade to remote agents). Both are in the roadmap.
MCP + A2A: Tools and Agents Together
A common question when working with both protocols: when does an agent use MCP, and when does it use A2A?
The answer is conceptually clean: MCP is for calling tools (functions), A2A is for calling agents (autonomous processes). They operate at different levels of abstraction.
An MCP tool is a synchronous, stateless function call. It takes typed inputs, returns typed outputs, and has no memory or ongoing state. Examples: search the web, query a database, send an email, run a code snippet.
An A2A agent is an autonomous process that can reason, plan, call tools itself, maintain state across multiple turns, and produce outputs asynchronously. Examples: research an entire topic and produce a report, plan and execute a multi-step data pipeline, manage an ongoing customer engagement.
In a well-designed multi-agent system, both are present and they compose cleanly:
Orchestrator Agent
├── MCP Server (tools)
│ ├── web_search()
│ ├── query_database()
│ └── send_email()
└── A2A Remote Agents
├── Research Agent (calls web_search MCP internally)
├── Analysis Agent (calls query_database MCP internally)
└── Outreach Agent (calls send_email MCP internally)
The orchestrator calls its MCP tools for simple, direct actions. It calls A2A agents for complex sub-goals that require multi-step reasoning. Each remote agent, in turn, may have its own MCP tools — the protocols are composable at every level.
The MCP specification and MCP integration for SaaS products cover the tooling side. The integration point to be explicit about: when an agent exposes itself via A2A, it's common to also expose a subset of its capabilities as MCP tools — for callers that want simple function-call access rather than full agentic delegation. This backward-compatible layering lets you adopt A2A incrementally without breaking existing MCP-based integrations.
Swarm Architecture: Fan-Out, Map-Reduce, and Voting
Swarm architectures run multiple agents in parallel on the same or related inputs. They're the right choice when:
- Work is embarrassingly parallel (each agent works on an independent slice)
- You need diverse perspectives and want to aggregate or vote on outputs
- Single-agent throughput is insufficient for latency requirements
There are three swarm sub-patterns, each with distinct coordination requirements.
Fan-Out (Parallel Execution)
The simplest swarm pattern: spawn N agents on the same input, collect all outputs independently. No aggregation needed — each agent produces an independent artifact.
import asyncio
from typing import List
async def fan_out_agents(
task: str,
agent_configs: List[dict],
timeout_seconds: int = 60
) -> List[dict]:
"""
Run N agents in parallel on the same task.
Returns all results, including partial results on timeout.
"""
async def run_agent(config: dict) -> dict:
try:
result = await execute_agent(
agent_id=config["id"],
task=task,
tools=config["tools"],
timeout=timeout_seconds
)
return {"agent_id": config["id"], "status": "success", "result": result}
except TimeoutError:
return {"agent_id": config["id"], "status": "timeout", "result": None}
except Exception as e:
return {"agent_id": config["id"], "status": "error", "error": str(e)}
# Run all agents concurrently, don't cancel on first failure
results = await asyncio.gather(
*[run_agent(config) for config in agent_configs],
return_exceptions=False
)
return results
Fan-out is straightforward but has a subtle failure mode: stragglers. One slow agent can hold up the entire workflow if you wait for all results. Use a timeout with partial result acceptance — if 7 of 8 agents complete within your SLO, return those 7 results and log the straggler.
Map-Reduce Swarm
The map-reduce pattern adds an aggregation step. Map phase: fan out agents over a large input (e.g., each agent processes one document in a corpus of 1,000). Reduce phase: a separate aggregator agent synthesizes all individual outputs into a final result.
async def map_reduce_workflow(
documents: List[str],
analysis_agent_config: dict,
aggregator_agent_config: dict,
batch_size: int = 10
) -> dict:
"""
Map: analyze each document in parallel batches.
Reduce: aggregate all analyses into a final report.
"""
# MAP phase - process in batches to control concurrency
all_analyses = []
for i in range(0, len(documents), batch_size):
batch = documents[i:i + batch_size]
batch_tasks = [
run_analysis_agent(analysis_agent_config, doc)
for doc in batch
]
batch_results = await asyncio.gather(*batch_tasks)
all_analyses.extend(batch_results)
# Checkpoint after each batch for resumability
await save_checkpoint(f"batch_{i}", batch_results)
# REDUCE phase - aggregate all analyses
reduction_input = {
"analyses": all_analyses,
"total_documents": len(documents),
"task": "Synthesize all individual analyses into a final comprehensive report"
}
final_report = await execute_agent(
agent_id=aggregator_agent_config["id"],
task=reduction_input,
tools=aggregator_agent_config["tools"]
)
return final_report
The critical design consideration for map-reduce: the aggregator agent's input grows linearly with the number of map agents. With 100 documents, each producing a 500-token analysis, your aggregator input is 50,000 tokens before it even starts its own reasoning. Design the map agents to produce compressed summaries rather than verbose reports, and use hierarchical reduction (reduce groups of 10, then reduce the group summaries) for very large corpora.
Voting and Consensus Swarm
The voting pattern runs multiple agents on the same problem and uses consensus to select or synthesize the best output. This is valuable for high-stakes decisions where a single agent's output carries too much risk.
async def consensus_vote(
task: str,
agents: List[dict],
voting_strategy: str = "majority" # majority | weighted | judge
) -> dict:
"""
Run multiple agents and aggregate via voting.
"""
# Collect all agent outputs
raw_outputs = await fan_out_agents(task, agents)
successful = [r for r in raw_outputs if r["status"] == "success"]
if voting_strategy == "majority":
# Cluster similar answers and pick the cluster with most votes
return await cluster_and_vote(successful)
elif voting_strategy == "weighted":
# Weight votes by agent confidence scores
return await weighted_aggregate(
successful,
weight_field="confidence"
)
elif voting_strategy == "judge":
# Use a separate judge agent to evaluate all outputs and pick best
judge_input = {
"task": task,
"candidate_outputs": [r["result"] for r in successful],
"instruction": "Evaluate each candidate output and select the most accurate, complete, and well-reasoned response."
}
return await execute_agent(
agent_id="judge-agent-v1",
task=judge_input,
tools=[]
)
Voting patterns add latency (you're waiting for the slowest agent in the group) and cost (N agents instead of 1). Reserve them for decisions where accuracy justifies that overhead: medical triage classification, financial fraud detection, legal document review. For general-purpose tasks, the cost-accuracy tradeoff rarely favors voting over a single well-prompted agent with good tools.
Supervisor-Worker Pattern
The supervisor-worker pattern is the workhorse of multi-agent systems for complex, open-ended tasks. A supervisor agent receives a high-level goal, dynamically plans sub-tasks, spawns worker agents to execute them, monitors their progress, and synthesizes results. This is what LangGraph calls hierarchical agent teams and what CrewAI multi-agent docs call process-based crews.
graph TB
U[User Goal:\n'Research competitive landscape\nfor Series B pitch deck'] --> S
subgraph Supervisor["Supervisor Agent"]
S[Plan decomposition:\n1. Identify competitors\n2. Analyze each competitor\n3. Synthesize comparison\n4. Format for deck]
MONITOR[Progress monitor\n+ state tracker]
SYNTH[Result synthesizer]
end
S --> W1 & W2 & W3
subgraph Workers["Worker Agents (parallel where possible)"]
W1[Competitor Discovery Agent\nTools: web_search, crunchbase_api]
W2[Market Data Agent\nTools: pitchbook_api, news_search]
W3[Product Analysis Agent\nTools: web_search, screenshot, analyze_image]
end
W1 -->|competitor list| MONITOR
W2 -->|market sizing data| MONITOR
MONITOR -->|trigger: competitor list ready| W4
W4[Deep-dive Agent x5\nOne per competitor found]
W4 -->|individual analyses| MONITOR
MONITOR --> SYNTH
W3 -->|product screenshots + analysis| SYNTH
SYNTH --> OUT[Structured competitive\nanalysis report]
style S fill:#7c3aed,color:#fff
style MONITOR fill:#7c3aed,color:#fff
style SYNTH fill:#7c3aed,color:#fff
style W1 fill:#2563eb,color:#fff
style W2 fill:#2563eb,color:#fff
style W3 fill:#2563eb,color:#fff
style W4 fill:#0891b2,color:#fff
Implementation Pattern
The supervisor's core loop is a planning-execution-monitoring cycle:
class SupervisorAgent:
def __init__(self, llm, available_workers: List[WorkerAgent]):
self.llm = llm
self.workers = {w.capability: w for w in available_workers}
self.state = WorkflowState()
async def run(self, goal: str) -> dict:
# Phase 1: Plan
plan = await self.plan(goal)
self.state.set_plan(plan)
# Phase 2: Execute with monitoring
while not self.state.is_complete():
ready_tasks = self.state.get_ready_tasks()
if not ready_tasks:
# Check for stalls
if self.state.has_stalled():
await self.handle_stall()
else:
await asyncio.sleep(0.5)
continue
# Dispatch ready tasks to workers
dispatch_tasks = [
self.dispatch(task) for task in ready_tasks
]
results = await asyncio.gather(*dispatch_tasks, return_exceptions=True)
# Update state with results
for task, result in zip(ready_tasks, results):
if isinstance(result, Exception):
await self.handle_worker_failure(task, result)
else:
self.state.complete_task(task.id, result)
# Phase 3: Synthesize
return await self.synthesize(self.state.all_results())
async def plan(self, goal: str) -> Plan:
"""LLM-based dynamic task decomposition."""
response = await self.llm.complete(
system=SUPERVISOR_SYSTEM_PROMPT,
user=f"Goal: {goal}\nAvailable workers: {self.get_worker_manifest()}",
response_format=PlanSchema # Structured output
)
return Plan.from_llm_response(response)
async def handle_worker_failure(self, task: Task, error: Exception):
"""Retry transient failures, escalate permanent ones."""
if task.retry_count < MAX_RETRIES and is_transient(error):
task.retry_count += 1
self.state.requeue_task(task)
else:
# Ask supervisor LLM whether to fail, skip, or replan
decision = await self.llm.complete(
system=FAILURE_HANDLER_PROMPT,
user=f"Task {task.id} failed: {error}\nCurrent state: {self.state.summary()}"
)
await self.apply_failure_decision(decision, task)
Critical Supervisor Design Requirements
The supervisor must checkpoint state after each task completion. If the supervisor crashes mid-execution, it needs to resume from the last checkpoint rather than restart from scratch. A 20-minute workflow that loses 15 minutes of work on crash is a production incident.
The supervisor needs a stall detector. Workers can hang indefinitely on bad inputs or network issues. Set a per-task timeout and have the supervisor treat elapsed-timeout as a failure to trigger the retry/escalation path.
Dynamic replanning is expensive but necessary. When a worker returns an unexpected result that invalidates the original plan, the supervisor needs to replan — not force-fit bad intermediate results into an obsolete plan. Build replanning in as a first-class operation, not an afterthought.
The reference implementations in OpenAI Agents SDK for production multi-agent orchestration cover the practical SDK patterns. The architecture here describes the conceptual structure that applies across frameworks.
Pipeline Pattern: Sequential Handoffs
The pipeline pattern models a workflow as a sequence of stages, each transforming the output of the previous stage. It's the right pattern when steps have strict ordering dependencies and intermediate outputs are well-defined.
graph LR
IN[Raw Input] --> A
subgraph PIPE["Pipeline: Document Processing Workflow"]
A[Stage 1\nIngestion Agent\nParse + normalize] -->|Structured document| B
B[Stage 2\nExtraction Agent\nEntities + metadata] -->|Extraction result| C
C[Stage 3\nEnrichment Agent\nExternal data lookup] -->|Enriched record| D
D[Stage 4\nAnalysis Agent\nReasoning + scoring] -->|Analysis result| E
E[Stage 5\nFormatting Agent\nOutput generation]
end
E --> OUT[Final Output\nFormatted Report]
subgraph PARALLEL["Parallel Variant"]
P_IN[Input] --> PA & PB & PC
PA[Stage A] --> P_JOIN[Join]
PB[Stage B] --> P_JOIN
PC[Stage C] --> P_JOIN
P_JOIN --> P_OUT[Output]
end
style A fill:#0891b2,color:#fff
style B fill:#0891b2,color:#fff
style C fill:#0891b2,color:#fff
style D fill:#0891b2,color:#fff
style E fill:#0891b2,color:#fff
style PA fill:#16a34a,color:#fff
style PB fill:#16a34a,color:#fff
style PC fill:#16a34a,color:#fff
style P_JOIN fill:#16a34a,color:#fff
Implementing Pipeline Handoffs
The critical implementation detail in pipelines is the handoff schema — the data structure passed from one stage to the next. A loosely typed handoff (passing a raw string or dict) leads to silent data loss and hard-to-debug downstream failures. Define it explicitly:
// Strict typed handoff schema for document pipeline
interface DocumentPipelineHandoff {
pipeline_id: string;
stage_completed: string;
timestamp_utc: string;
// Accumulated context — grows through pipeline
original_input_ref: string; // Reference to raw input in object store
completed_stages: StageResult[]; // All prior stage outputs (compressed)
// Current payload — what this stage produced
payload: {
content: string; // Primary output
metadata: Record<string, unknown>;
confidence: number; // 0-1, used by downstream stages
warnings: string[]; // Non-fatal issues to propagate
};
// Budget tracking
tokens_used_so_far: number;
tokens_budget_remaining: number;
// Error context (non-null if prior stage had partial failure)
partial_failure?: {
failed_field: string;
fallback_used: string;
};
}
The tokens_budget_remaining field is not cosmetic — it's operational. In a five-stage pipeline where each stage uses 5,000 tokens, you've consumed 20,000 tokens by the time you reach stage 5. Without budget tracking, stage 5 agents will happily consume another 15,000 tokens on verbose analysis that blows your cost model. Pass the remaining budget explicitly so each stage can calibrate its output verbosity.
Pipeline vs. Parallel Coordination
The sequential pipeline pattern and the parallel fan-out pattern are often used together: run independent stages in parallel where possible, then merge at synchronization points. The architectural question is where to put the sync barriers.
A practical rule: add a sync barrier anywhere two or more parallel branches produce outputs that must be jointly reasoned about rather than independently appended. Running entity extraction and sentiment analysis in parallel is fine — they produce independent outputs. Running product analysis and competitor analysis in parallel is only valid if the synthesis step can handle receiving both asynchronously; if it needs both to reason correctly, it's a sync barrier.
Hierarchical Multi-Agent Systems
Hierarchical systems are supervisor-worker patterns applied recursively. A top-level supervisor delegates to mid-level supervisors, each of which manages its own worker pool. This mirrors how human organizations work — a CTO delegates to engineering managers, who delegate to engineers.
Hierarchical architectures are justified when:
- No single supervisor can hold the full plan in context (you have 20+ workers)
- Sub-goals are independently meaningful and can be tracked separately
- Different sub-teams need different tool access or authentication scopes
- You want independent failure domains (one sub-team failing shouldn't cascade)
The coordination overhead is real: each level of hierarchy adds latency (supervisor → mid-supervisor → worker → mid-supervisor → supervisor is 4 hops minimum) and context loss (summaries lose information). Only add hierarchy when the scale genuinely demands it.
class HierarchicalOrchestrator:
"""
Two-level hierarchy: top supervisor delegates to sub-supervisors,
each managing their own worker pool.
"""
def __init__(self, sub_supervisors: List[SubSupervisor]):
self.sub_supervisors = {s.domain: s for s in sub_supervisors}
self.top_llm = create_llm(model="claude-opus-4", temperature=0)
async def run(self, goal: str) -> dict:
# Top-level decomposition: split goal into domain sub-goals
domain_assignments = await self.decompose_by_domain(goal)
# Delegate each domain sub-goal to its sub-supervisor
sub_goals = [
self.sub_supervisors[domain].run(sub_goal)
for domain, sub_goal in domain_assignments.items()
]
# Run sub-supervisors in parallel where domains are independent
sub_results = await asyncio.gather(*sub_goals, return_exceptions=True)
# Top-level synthesis
return await self.synthesize_across_domains(
goal=goal,
domain_results={
domain: result
for domain, result in zip(domain_assignments.keys(), sub_results)
if not isinstance(result, Exception)
}
)
A practical note on context management in hierarchies: each level's supervisor sees only summaries of the levels below it, not raw outputs. Design explicit summarization contracts between levels. When a worker produces a 3,000-token analysis, the mid-supervisor compresses it to a 200-token summary before forwarding to the top supervisor. This is not optional — without it, top supervisors run out of context window on non-trivial workflows.
Latency and Cost Trade-offs in Hierarchical Systems
The common objection to hierarchical architectures is latency. Every level of indirection adds a round-trip — the top supervisor must wait for mid-supervisors, who must wait for workers, who must wait for external tools. In a flat supervisor-worker system, the critical path is supervisor → worker → result. In a two-level hierarchy, it's top_supervisor → mid_supervisor → worker → result → mid_supervisor synthesis → top_supervisor synthesis — at minimum double the latency for the deepest tasks.
The counter-argument is parallelism. A hierarchical system can run sub-teams in parallel across domains, so the wall-clock time is determined by the slowest sub-team, not the sum of all sub-teams. For workflows where sub-domains are genuinely independent (research + legal + financial analysis running concurrently), hierarchical architectures can be faster than flat supervisor-worker, which serializes or limits concurrency through a single supervisor's planning step.
The practical performance comparison:
The cost picture is different: hierarchical systems consume more tokens per workflow because synthesis happens at every level. A mid-supervisor synthesizing 5 worker outputs into a summary consumes tokens; the top supervisor then synthesizes mid-supervisor summaries, consuming tokens again. Budget 20-40% higher token cost per workflow when moving from flat to two-level hierarchy. That overhead is worth paying when you'd otherwise hit context window limits or single-supervisor throughput ceilings.
The Gartner projection that 40% of enterprise applications will embed task-specific AI agents by end of 2026 makes the hierarchical architecture question urgent — organizations deploying agent systems for the first time are often under-architecting, starting with flat pipelines that don't scale to the workflow complexity they'll encounter in production.
Error Handling in Multi-Agent Systems
Error handling in multi-agent systems is categorically different from single-agent error handling. The differences that matter:
-
Partial completion is the common case. In a 10-step pipeline, step 7 can fail while steps 1-6 have already committed state. Retrying from scratch is expensive and may cause double-writes. You need step-level checkpointing and idempotent state transitions.
-
Failures cascade. An agent that returns bad output — not an error, just a wrong answer — causes downstream agents to reason on bad premises. By step 10, the compounded error is unrecoverable. You need output validation between stages, not just error catching.
-
Retries can cause duplicate side effects. If stage 3 sends an email and then crashes before acknowledging success, retrying stage 3 sends the email twice. All external side effects in pipelines must be behind idempotency keys.
-
Timeouts compound. In a 5-agent parallel fan-out with a 30-second timeout on each agent, your worst-case latency is 30 seconds. But in a 5-stage sequential pipeline with 30-second timeouts at each stage, your worst case is 150 seconds. Design timeout budgets for the whole workflow, not individual agents.
The Error Taxonomy
Treat errors in multi-agent systems as one of four types, with different handling for each:
class AgentErrorType(Enum):
TRANSIENT = "transient" # Retry: network timeout, rate limit, OOM
VALIDATION = "validation" # Fix input: malformed request, missing field
CAPABILITY = "capability" # Replan: agent lacks needed skill
FATAL = "fatal" # Escalate: auth failure, resource not found
async def handle_agent_error(
error: Exception,
task: Task,
context: WorkflowContext
) -> ErrorAction:
error_type = classify_error(error)
if error_type == AgentErrorType.TRANSIENT:
if task.retry_count < MAX_RETRIES:
backoff = min(2 ** task.retry_count, 60) # Cap at 60s
return ErrorAction(action="retry", delay_seconds=backoff)
else:
return ErrorAction(action="escalate", reason="max_retries_exceeded")
elif error_type == AgentErrorType.VALIDATION:
# Try to auto-correct the input via LLM
corrected = await attempt_input_correction(task, error)
if corrected:
return ErrorAction(action="retry_with_corrected_input", input=corrected)
return ErrorAction(action="fail_task", reason="invalid_input")
elif error_type == AgentErrorType.CAPABILITY:
# Ask supervisor to replan without this capability
alternative = await find_alternative_agent(task, context.available_agents)
if alternative:
return ErrorAction(action="reroute", agent=alternative)
return ErrorAction(action="degrade_gracefully", fallback=task.fallback)
elif error_type == AgentErrorType.FATAL:
return ErrorAction(action="halt_workflow", preserve_state=True)
Circuit Breakers for Agent Dependencies
When an external service used by one of your agents starts failing, you don't want all agents in your system hammering it with retries. Implement per-dependency circuit breakers:
class AgentCircuitBreaker:
def __init__(self, failure_threshold: float = 0.5, window_seconds: int = 60):
self.failure_threshold = failure_threshold
self.window_seconds = window_seconds
self._requests: deque = deque()
self.state = "closed" # closed | open | half-open
async def call(self, func, *args, **kwargs):
if self.state == "open":
if self._should_attempt_reset():
self.state = "half-open"
else:
raise CircuitOpenError("Circuit breaker open — failing fast")
try:
result = await func(*args, **kwargs)
self._record_success()
return result
except Exception as e:
self._record_failure()
if self._failure_rate() > self.failure_threshold:
self.state = "open"
self._open_timestamp = time.time()
raise
Circuit breakers on agent dependencies prevent the thundering herd problem: when a model provider has a partial outage, you want to fail fast and return cached/degraded results, not pile up thousands of retry requests that amplify the incident.
Human-in-the-Loop Escalation
Not every multi-agent failure should be handled autonomously. High-stakes or irreversible actions — sending a contract, publishing a pricing change, deleting records — should have explicit escalation gates where a human reviews before the agent proceeds. The architecture challenge: escalation must be asynchronous, not blocking.
Blocking escalation (agent waits synchronously for human approval) ties up resources, makes workflows non-resumable if the review takes hours, and creates a terrible user experience. Asynchronous escalation (agent checkpoints state, emits an escalation request to a review queue, and terminates) lets the workflow resume later when a human approves.
class EscalationManager:
def __init__(self, review_queue: ReviewQueue, state_store: StateStore):
self.queue = review_queue
self.state = state_store
async def request_review(
self,
workflow_id: str,
pending_action: dict,
context: dict,
reason: str
) -> str:
"""
Checkpoint workflow state and emit escalation request.
Returns escalation_id that resumes the workflow when approved.
"""
escalation_id = generate_id("esc")
# Save full workflow state so we can resume
await self.state.save_checkpoint(
key=f"escalation:{escalation_id}",
value={
"workflow_id": workflow_id,
"pending_action": pending_action,
"context": context,
"created_at": utcnow()
},
ttl_hours=72 # Escalations expire after 3 days
)
# Push to human review queue
await self.queue.push({
"escalation_id": escalation_id,
"workflow_id": workflow_id,
"reason": reason,
"action_summary": summarize_action(pending_action),
"resume_endpoint": f"/workflows/{workflow_id}/resume/{escalation_id}"
})
return escalation_id
async def resume_after_approval(self, escalation_id: str, approved: bool) -> dict:
checkpoint = await self.state.get(f"escalation:{escalation_id}")
if not checkpoint:
raise EscalationExpiredError(escalation_id)
if approved:
return await execute_pending_action(checkpoint["pending_action"])
else:
return await skip_with_fallback(checkpoint["workflow_id"])
Define escalation triggers explicitly in your system's configuration rather than embedding them in agent prompts. Agent prompts are hard to audit; a configuration-level escalation policy is reviewable, versionable, and auditable. Good escalation triggers: confidence below 0.7 on a consequential classification, action modifying more than N records, first occurrence of a new action type, any action in a restricted domain (legal, financial, PII-adjacent).
Observability for Error Diagnosis
Error handling is only as good as your ability to diagnose what went wrong. Three observability investments that pay off immediately in production:
Structured error events — rather than logging a stack trace, emit a structured event with: error_type, agent_id, workflow_id, trace_id, input_token_hash (for reproducibility), step_number, retry_count, and error_category. This lets you aggregate errors by type across workflows, identify agents that fail disproportionately on specific input patterns, and correlate error spikes with upstream provider incidents.
Error rate dashboards per agent — track each agent's error rate independently, not just the overall workflow failure rate. A workflow that uses 8 agents and has a 10% overall failure rate might have one agent causing 80% of failures. Without per-agent breakdowns, you won't find it. Set alert thresholds on per-agent error rates (alert if above 5% over a 5-minute window) rather than only on end-to-end workflow success rate.
Replay tooling for failed workflows — when a workflow fails at step 7 of 12, you want to replay from step 6 with a corrected agent configuration, not restart from step 1. Build a replay button that re-runs a specific workflow from any checkpoint with optionally different agent configs, model versions, or tool definitions. This is the fastest path from "production incident" to "validated fix" — without it, you're guessing and redeploying into production.
Production Examples
Salesforce Agentforce
Salesforce's Agentforce is one of the most cited production multi-agent systems, having processed over 1 billion autonomous actions in its first six months. The architecture uses a three-tier hierarchy: Einstein Copilot as the top-level orchestrator, domain agents (Sales Coach, Service Agent, Marketing Agent) as mid-level supervisors, and action agents as workers. Every agent action is logged as an explicit step in an auditable workflow — not for compliance after the fact, but as a first-class architectural requirement.
The key engineering decision: they chose orchestrated choreography where domain agents are registered in a central capability registry and discovered dynamically, but execution is supervised with explicit state tracking. The business result — 43% reduction in handle time, 28% improvement in activity logging completeness — came from reliable handoffs between agents, not from any single agent being more capable.
Anthropic's Research Agent Framework
Anthropic's research on multi-agent systems demonstrates a pattern they call "parallelization for tasks requiring multiple independent checks." Their documented finding: tasks that benefit most from parallelization are those where multiple independent analyses need to be cross-checked, not tasks where one agent builds on another's output. This reinforces the swarm-with-voting pattern for verification-heavy workflows like security audits, compliance checking, and medical diagnosis assistance.
Their research also quantifies the context window fragmentation problem in long-horizon tasks: agents operating on tasks requiring 20+ steps experience a 2-3x increase in error rate compared to 5-step tasks, primarily from accumulated context degradation. Hierarchical summarization between stages reduces this error amplification to roughly 1.3x.
LangGraph's Orchestration Primitives
The LangGraph platform has become the de facto implementation layer for Python-based multi-agent systems. Their state graph abstraction maps directly to the patterns here: nodes are agents (or tool calls), edges are conditional transitions, and the graph structure determines whether you're building a pipeline (linear graph), supervisor-worker (hub-and-spoke graph), or hierarchical system (nested sub-graphs). Their built-in checkpointing and human-in-the-loop primitives handle the reliability requirements described in the error handling section.
The practical advantage of LangGraph for the patterns in this article: state management is handled by the framework, so you don't have to implement your own checkpoint store or retry logic from scratch. The tradeoff: the framework adds abstraction overhead that can make debugging harder when things go wrong — understanding what's happening requires understanding the framework's internals, not just your own code.
For the AI agent startup opportunity landscape, these orchestration patterns are increasingly the deciding factor in which agent frameworks win enterprise adoption. The products that implement reliable supervisor-worker with good observability at scale will displace products that only offer pipeline-style sequential execution — complex enterprise workflows don't fit in a linear chain.
CrewAI Multi-Agent Architecture
CrewAI's multi-agent framework implements the crew metaphor: a group of agents with defined roles, goals, and backstories working toward a shared objective. Their Process.sequential maps to the pipeline pattern; Process.hierarchical maps to supervisor-worker. The practical differentiation is their emphasis on agent persona definition — each crew member has an explicit role that shapes how the LLM frames its reasoning, which reduces coordination failures caused by agents "forgetting" their scope.
The AI agents replacing SaaS trend is driving rapid adoption of these frameworks. The pattern to watch: agents built on CrewAI or LangGraph calling external agents via A2A protocol, combining framework-native coordination for owned agents with protocol-level interop for third-party agents.
Architectural Lessons Across All Production Systems
Looking across these production deployments, several engineering patterns emerge consistently as differentiators between systems that scale and those that stall in pilot:
Explicit handoff schemas win over implicit context passing. Every production system that scaled past 5 agents had formalized, versioned data structures for inter-agent communication. Systems that passed raw strings or unstructured JSON between agents accumulated technical debt in parsing logic, failed silently when field names changed, and were nearly impossible to debug when an agent produced unexpected output 8 steps into a workflow.
Observability was built first, not added later. Salesforce's audit-log-as-architecture decision and LangGraph's built-in checkpointing both reflect the same insight: you cannot retroactively add the tracing data you need to debug a production failure. Build tracing infrastructure before your first production deployment, even if the system is small. The cost of adding it later — backfilling trace IDs, updating all agents to emit spans, rebuilding dashboards — exceeds the cost of building it upfront by a significant margin.
Model selection is per-agent, not per-system. Production systems don't use the same model for every agent. Supervisors that need careful planning use frontier models (Claude Opus, GPT-4o) at higher cost. Workers performing well-defined extraction tasks use smaller, faster models (Claude Haiku, GPT-4o-mini) at a fraction of the cost. This per-agent model selection reduces total system cost by 40-70% compared to using a frontier model uniformly, with minimal accuracy impact when agents have bounded, well-defined tasks.
Failure domains are explicit design decisions. The systems that recovered fastest from incidents had deliberately isolated failure domains. An enrichment agent failing shouldn't halt an outreach agent that doesn't depend on enrichment data. Map your dependency graph explicitly, run independent agents in parallel regardless of the pipeline metaphor, and implement dead-letter queues for agent outputs that downstream agents haven't consumed — this is how you build a system where partial failures degrade gracefully instead of cascading.
FAQ
When should I use orchestration vs. choreography?
Start with orchestration. It's easier to debug, easier to reason about, and easier to add monitoring. Migrate to choreography only when you hit genuine scale bottlenecks — either the orchestrator is too slow, the orchestrator is too stateful to scale horizontally, or you need geographic distribution. At under 100 concurrent workflows, orchestration handles load fine and the debuggability advantage is worth the architectural simplicity.
What's the minimum viable A2A implementation?
Expose a /.well-known/agent.json AgentCard and implement POST /tasks/send with GET /tasks/{id} polling. SSE streaming is optional for your first iteration. Authentication via Bearer token is sufficient initially. This three-endpoint surface gives you A2A-compatible agent discoverability without implementing the full spec upfront. Add SSE and push notifications when you have tasks that take longer than 10 seconds to complete.
How do I debug a multi-agent failure in production?
The debugging workflow: (1) find the trace ID for the failed workflow in your distributed tracing system, (2) reconstruct the agent call graph for that trace ID, (3) find the first agent that returned an unexpected output (not necessarily the one that threw the error), (4) inspect that agent's full context window input at the time of failure. Step 3 is the hard part — errors surface downstream of the root cause. Always instrument agents to log the first N tokens of their input on error (with PII scrubbing if applicable).
How many agents are too many?
There's no fixed limit, but coordination overhead grows roughly quadratically with agent count in flat architectures. In practice: flat architectures work well up to 4-5 agents, supervisor-worker handles 5-15 workers effectively, hierarchical systems are needed above 15. The more useful constraint is context window budget: each additional agent in a workflow adds its output to the shared context. Above 8-10 agents in a sequential workflow, you'll need explicit context compression or hierarchical summarization to avoid running out of context budget at later stages.
Can A2A and MCP run on the same server?
Yes, and this is a common deployment pattern. An agent server can expose both an MCP tools endpoint (for callers that want specific function calls) and an A2A agent endpoint (for callers that want to delegate entire sub-goals). The same underlying capabilities are exposed at two abstraction levels, letting callers choose the level of autonomy they want to delegate. Route MCP calls to stateless tool handlers and A2A task calls to stateful agent execution — they share the same tool implementations but differ in how they manage state and turn-by-turn reasoning.
What observability should every multi-agent system have?
Minimum viable: distributed trace IDs propagated through every agent handoff, per-agent token consumption logged with trace ID, and error type classification (transient/validation/capability/fatal) in all error logs. Nice to have: per-workflow cost attribution, agent output confidence scores, and context window utilization percentage per agent call. The missing trace ID is the single most common cause of unresolvable production incidents — add it before you need it.
How does the supervisor-worker pattern handle a case where the supervisor itself fails?
The supervisor must checkpoint its own state (current plan, completed tasks, pending tasks) after every state transition — not just worker outputs. If the supervisor crashes, the restart logic reads the last checkpoint, marks in-flight tasks as "needs-verification" (they may or may not have completed before the crash), and re-queues them with idempotency keys to handle potential double-execution. This is the same checkpoint-and-replay pattern used in distributed job systems like Temporal and Apache Flink. Build it from the start, not as an afterthought.
The orchestration patterns covered here — pipeline, swarm, supervisor-worker, hierarchical — are not mutually exclusive. Real production systems compose them: a top-level supervisor that fans out to parallel swarms, where each swarm produces results fed into a synthesis pipeline. The key engineering discipline is making each composition point explicit — a clear handoff schema, a clear error contract, and clear observability at every boundary. Systems that are opaque at their composition points fail silently and recover slowly.
For the product architecture dimension of multi-agent systems — semantic APIs, capability discovery, state management, and the 90-day roadmap to production — see Multi-Agent Orchestration: How to Design Products Where AI Agents Are the Primary Users. For how agent infrastructure is being positioned as the next startup opportunity, see The AI Agent Startup Opportunity.