Session 6: Multi-Agent Systems & Orchestration
The moment your single agent starts hallucinating because it's juggling 30 tools—search, database query, email send, document parse, API call—you've hit the ceiling. One LLM, one context window, one brain trying to be creative and analytical and cautious all at once. It's not a scaling problem. It's an architecture problem.
Senior AI engineers don't fix this by throwing bigger models at it. They split the job. Multi-agent systems are where production AI gets interesting—and where interviews separate candidates who've shipped from those who've read blog posts. If you've built an AI platform at Maersk with centralized LLMs, guardrails, and agent orchestration, you've got stories. This session sharpens them.
Here's the uncomfortable truth: most teams over-engineer. Anthropic's research shows the best agent systems start with the simplest possible solution. Add agents only when you demonstrably need them. This session covers when, why, and how—with the technical depth you need to defend your choices in a room full of skeptics.
1. Why Multi-Agent Systems Exist
Interview Insight: Interviewers want to know you understand the tradeoff—multi-agent adds latency, cost, and complexity. They're testing whether you can justify the architecture, not just diagram it.
Multi-agent systems are like a well-run restaurant kitchen. The head chef doesn't dice onions, grill steaks, and plate desserts. She delegates to specialists who excel at one thing. A single agent, no matter how capable, hits limits: context window exhaustion when holding dozens of documents, tool overload when choosing among 30+ tools, and conflicting responsibilities when one agent must be creative, analytical, and cautious at once.
Divide and conquer. By breaking complex workflows into specialized sub-tasks, each handled by a purpose-built agent, you gain specialization (focused prompts and tools improve accuracy), fault isolation (a researcher failure doesn't kill the whole pipeline), independent scaling (I/O-bound vs compute-bound agents scale differently), and modularity (swap or A/B test agents without redesigning everything).
The complexity spectrum: direct LLM call → single agent with tools → full multi-agent system. Start with the simplest solution. For many apps, retrieval + in-context examples is enough. Agentic systems trade latency and cost for better task performance—use them only when that tradeoff makes sense.
Why This Matters in Production: At Maersk, your email-to-booking pipeline has distinct stages—ingestion, extraction, RAG enrichment, validation—each benefiting from different prompts and tools. A single monolithic agent would drown in context. Split agents let you tune each stage independently.
Aha Moment: The best multi-agent systems often have fewer total LLM calls than a poorly designed single agent that keeps retrying and rethinking. Specialization reduces noise.
2. Architecture Patterns
Interview Insight: They want you to name patterns and defend tradeoffs. "Supervisor" vs "sequential pipeline" isn't trivia—it's a design decision with real implications.
Supervisor / Coordinator Pattern
Analogy: The supervisor is the project manager in a consulting firm. She reads the brief, assigns the right specialist (researcher, analyst, writer), reviews their work, and either sends it back for revision or moves it forward. Exactly one person is in charge—no confusion.
A central orchestrator decomposes tasks, delegates to specialists, evaluates outputs, and synthesizes the final result. Critical rule: exactly one agent must be the orchestrator. Two coordinators = duplicated work, contradictory instructions, race conditions.
Implementation: LLM-based supervisor (model decides next agent dynamically—flexible, context-dependent) or code-based supervisor (deterministic rules, e.g. "if task contains 'research' → researcher"—predictable, debuggable). The supervisor can send work back for refinement—e.g. analyst finds gaps, supervisor routes back to researcher.
Trade-offs: Maximum flexibility; supports iterative refinement. Downside: coordination overhead, every decision flows through one node. Use when task decomposition can't be predetermined.
Sequential Pipeline
Analogy: An assembly line. Car moves from welding → painting → assembly → inspection. Each station depends entirely on the previous one.
Agent A → Agent B → Agent C → done. Research gathers sources, writer drafts, editor refines, fact-checker validates. Simple, linear, debuggable. Downside: high latency (sum of all stages), fragility (one failure halts everything). Use when dependencies are fixed and unavoidable.
Hierarchical Pattern
Analogy: A company with departments. CEO delegates to VP Research and VP Engineering; each VP manages their own team of specialists.
Supervisor → sub-supervisors → agents. Scales to multi-domain problems (research lead manages web-search + document-fetch; engineering lead manages code-writer + test-runner). Downside: coordination and debugging complexity at multiple levels.
Parallel Fan-Out / Fan-In
Supervisor sends tasks to multiple agents simultaneously, collects results. Same task (e.g. three agents each draft a proposal—aggregate via voting) or different tasks (search doc A, B, C in parallel—merge results). Reduces latency when subtasks are independent; higher cost. Use when subtasks don't depend on each other.
Debate / Consensus Pattern
Multiple agents propose, critique, refine. Good for high-stakes decisions. Very expensive. Use sparingly.
Swarm Pattern
Agents self-organize without central control. Each can hand off based on specialization. Works when the problem space is well-partitioned; struggles when global coordination is needed. Rare in production—most use a hybrid: swarm-like handoffs within a domain, supervised by a higher-level orchestrator.
Anthropic's Five Building Blocks (Dec 2024)
Anthropic's "Building Effective Agents" research distills patterns from dozens of production systems. Workflows (LLMs + tools orchestrated by predefined code) vs agents (LLMs dynamically directing tool use)—workflows for predictability, agents for flexibility.
The five patterns: (1) Prompt chaining—decompose into steps, each LLM processes previous output; add programmatic gates to stay on track. (2) Routing—classify input, direct to specialized follow-ups; distinct categories get distinct handling. (3) Parallelization—same task multiple times (voting) or independent subtasks (sectioning) run in parallel. (4) Orchestrator-worker—central LLM dynamically breaks down tasks, delegates, synthesizes; subtasks not pre-defined. (5) Evaluator-optimizer—one LLM generates, another evaluates and gives feedback in a loop; iterative refinement when it provides measurable value.
Philosophy: Start with the simplest solution. Many patterns = a few lines of code. Frameworks (LangGraph, Vellum, Rivet) can obscure prompts and make debugging harder. Understand what's under the hood before adopting frameworks. The augmented LLM—retrieval, tools, memory—is the foundation; MCP standardizes integrating these.
Why This Matters in Production: Your Maersk platform's orchestration layer is the supervisor. It routes requests to customer support, document processing, or email-booking agents based on intent. The API-based interface gives you a single control plane—exactly one orchestrator.
Aha Moment: LangGraph's conditional edges are just supervisor routing. When you
add_conditional_edges("supervisor", route_fn, {...}), you're implementing the pattern. No magic—just code.
3. Inter-Agent Communication
Interview Insight: How agents pass context is where most production bugs hide. Interviewers probe whether you've hit handoff failures.
Analogy: Teammates passing a baton in a relay. The handoff point is where races happen—or where the baton gets dropped.
Shared state: Central object (e.g. LangGraph state dict) all agents read/write. Simple, explicit. Risk: race conditions, conflicting writes.
Message passing: Structured messages (sender, recipient, payload, metadata). Decouples agents; requires a message bus or routing layer. Good for async, event-driven flows.
Event-driven (pub/sub): Agents subscribe to events ("research_complete"), react when published. Loose coupling; multiple agents can react to same event.
Blackboard pattern: Shared data store; agents read/write independently. No direct agent-to-agent communication. Good for collaborative problem-solving.
Production rule: Never pass free-form text. Use explicit, typed handoff schemas: summary, citations, evidence_map, open_questions, confidence_score, metadata. Forces each agent to be explicit about what it found and what's unclear.
Why This Matters in Production: In your email-to-booking flow, the extraction agent hands off to RAG enrichment. If you pass raw text, context is lost. If you pass a typed struct with
extracted_fields,confidence,open_questions, the next agent knows exactly what to work with—and you can debug which stage failed.
Aha Moment: Most "model hallucination" bugs in multi-agent systems are actually handoff bugs. Agent A was right; Agent B never got the right context.
4. MCP (Model Context Protocol)
Interview Insight: MCP is hot. They want to know you understand it and how it fits with A2A. "USB-C for AI" is the soundbite; the architecture is what you'll be quizzed on.
Analogy: MCP is like USB-C—one standard interface. Your laptop doesn't care if you plug in a monitor, a hard drive, or a dock. The MCP client doesn't care if the server is a database, a filesystem, or Sentry. Plug and play.
What it is: Standardized protocol for connecting AI agents to tools and data sources. Anthropic announced it Nov 2024; now under Linux Foundation's Agentic AI Foundation.
Architecture: Host → Client → Server. The MCP Host is your AI app (Cursor, Claude Desktop, your platform). The MCP Client maintains a dedicated connection to a server. The MCP Server provides context and capabilities. One host, many clients; one client per server.
flowchart TB
subgraph Host["MCP Host - AI Application"]
C1[MCP Client 1]
C2[MCP Client 2]
C3[MCP Client 3]
end
S1["MCP Server A - Local - e.g. Filesystem"]
S2["MCP Server B - Local - e.g. Database"]
S3["MCP Server C - Remote - e.g. Sentry"]
C1 ---|"Dedicated connection"| S1
C2 ---|"Dedicated connection"| S2
C3 ---|"Dedicated connection"| S3Transport: Stdio (local processes, no network) or Streamable HTTP (remote, OAuth, bearer tokens, API keys).
Primitives: Tools—callable functions (model-controlled). Resources—read-only data (application-controlled). Prompts—reusable templates (user-controlled).
Why it matters: One MCP client, any MCP server. New tools become plug-and-play. Your platform's tool access—MCP-compatible servers expose capabilities; agents connect via standardized discovery and invocation.
Why This Matters in Production: Your Maersk platform with centralized LLM access and guardrails likely integrates tools via MCP or a similar abstraction. When a new team wants to add a database tool, they stand up an MCP server—your agent code doesn't change.
Aha Moment: MCP = vertical (agent-to-tool). A2A = horizontal (agent-to-agent). Together they're the full stack. Your orchestrator uses A2A-style handoffs to specialists; each specialist uses MCP to talk to tools.
5. A2A (Agent-to-Agent Protocol)
Interview Insight: A2A is newer; knowing it signals you keep up. The key distinction: MCP for tools, A2A for agents.
What it is: Google's Agent-to-Agent Protocol (April 2025). Standardizes inter-agent communication. 50+ partners (Atlassian, Salesforce, LangChain, MongoDB, etc.).
Core concepts: Agent Cards—JSON metadata (capabilities, input formats, auth, endpoint URLs) so a coordinator can discover the right specialist. Task management—workflows for initiating, progressing, completing. Streaming—SSE for long-running tasks. Transport—JSON-RPC 2.0 over HTTP(S); v0.3 adds gRPC.
Design: Simplicity (HTTP, JSON-RPC), enterprise-ready (auth, security), async-first, modality-agnostic, opaque execution (agents collaborate without exposing internals).
MCP vs A2A: MCP = agent-to-tool. A2A = agent-to-agent. Orchestrator → A2A → specialist → MCP → tools → results back through A2A.
Why This Matters in Production: As your platform grows, domain teams will want agents to call each other—e.g. a booking agent delegating to a pricing agent. A2A gives you a standard handoff format instead of custom point-to-point integrations. Agent Cards enable discovery: the orchestrator finds the right specialist without hardcoded URLs.
Aha Moment: A2A isn't replacing MCP—they're orthogonal. MCP handles the vertical (agent ↔ tool); A2A handles the horizontal (agent ↔ agent). Your full stack uses both.
6. Task Decomposition & Agent Specialization
Interview Insight: "How does the supervisor decide what to do next?" is a common follow-up. Decomposition strategy matters.
Task decomposition: LLM-based—model analyzes goal, produces subtasks (flexible, non-deterministic). Rule-based—predefined templates (predictable, inflexible). Challenges: completeness, dependencies (B waits for A), knowing when done. Use iteration limits and explicit completion criteria.
Agent specialization: Each agent gets a focused role, tailored system prompt, curated tool set, and sometimes a different model. Researcher: reasoning model + search/fetch tools. Writer: creative model + formatting tools. Critic: evaluator model, no tools. Specialization improves accuracy, enables independent scaling, and isolates faults.
Why This Matters in Production: Your email extraction agent uses one prompt/model; your RAG enrichment agent uses another with retrieval tools. The evaluation pipeline in your platform—dataset management, prompt versions—lets you tune each agent's decomposition and specialization independently. Wrong routing or poor decomposition shows up in evals before production.
Aha Moment: The supervisor's decomposition prompt often matters more than individual agent prompts. Bad decomposition → agents get wrong subtasks → garbage in, garbage out. Fix the orchestrator first.
7. Handoffs and Loop Prevention
Interview Insight: Handoff design and infinite loops are where junior engineers get stuck. You need concrete mitigations.
Handoffs: One agent transfers control and context to another. In LangGraph, handoffs are edges—agent completes, returns state update, conditional edge fires. Context loss: Summary-only handoffs lose nuance; full raw output blows context windows. Structured schemas (summary, citations, confidence, open questions) balance the two.
Loop prevention: A → B → A → … Use iteration limits (hard cap on steps), explicit "done" signals, and cycle detection. Test handoff logic for cycles before production.
8. Production Challenges
Interview Insight: They want war stories. Cost explosion, infinite loops, debugging traces—these are the scars that prove you've shipped.
Race conditions: Multiple agents modifying shared state concurrently. Mitigation: single-writer patterns, or proper concurrency primitives.
Infinite loops: A delegates to B, B back to A. Mitigation: iteration limits, cycle detection, explicit termination.
Cost explosion: 10 agents × 5 steps = 50+ LLM calls per request. Mitigation: semantic caching (up to ~90% reduction), iteration limits, smaller models for routing, per-request budgets.
Error propagation: One failure cascades. Mitigation: validation at handoffs, retries with backoff, fallback paths, circuit breakers.
Debugging traces: Which agent produced the bad output? Mitigation: structured logging (agent ID, request ID, step), OpenTelemetry, decision traces, replay from checkpoints.
Latency: Sequential calls add up. Mitigation: parallelize, stream for feedback, cache critical path.
Non-determinism squared: Each agent adds randomness. Mitigation: temperature=0 for routing, deterministic fallbacks, idempotency where possible.
Why This Matters in Production: Your MLflow observability tracks cost per request and per agent. That visibility is non-negotiable for multi-agent—without it, you're flying blind when costs spike. Guardrails at the gateway (max requests, content filters) are your first line of defense.
Aha Moment: Human-in-the-loop is a fallback path. When extraction confidence is low, route to human review instead of failing silently. You never lose a booking; you improve the agent over time.
Architecture Diagrams
flowchart TB
Start([User Request]) --> Supervisor
Supervisor -->|route| Researcher
Supervisor -->|route| Writer
Supervisor -->|route| Analyst
Researcher --> Supervisor
Writer --> Supervisor
Analyst --> Supervisor
Supervisor -->|FINISH| FinalOutput([Final Output])flowchart LR
Input([Input]) --> ResearchAgent[Research Agent]
ResearchAgent --> WritingAgent[Writing Agent]
WritingAgent --> ReviewAgent[Review Agent]
ReviewAgent --> Output([Output])flowchart TB
Request([Request]) --> ProjectManager[Project Manager]
ProjectManager --> ResearchLead[Research Lead]
ProjectManager --> EngineeringLead[Engineering Lead]
ResearchLead --> WebSearch[Web Search Agent]
ResearchLead --> DocFetch[Document Fetch Agent]
EngineeringLead --> CodeWriter[Code Writer Agent]
EngineeringLead --> TestRunner[Test Runner Agent]
WebSearch --> ResearchLead
DocFetch --> ResearchLead
CodeWriter --> EngineeringLead
TestRunner --> EngineeringLead
ResearchLead --> ProjectManager
EngineeringLead --> ProjectManager
ProjectManager --> Synthesis([Synthesis])Code Examples
LangGraph Supervisor Pattern
from typing import TypedDict, Annotated, Literal
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langgraph.prebuilt import create_react_agent
from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, SystemMessage
class AgentState(TypedDict):
messages: Annotated[list, add_messages]
current_agent: str
iteration: int
model = ChatAnthropic(model="claude-sonnet-4-20250514", temperature=0)
# Specialists get focused prompts and tools—no tool sprawl
research_agent = create_react_agent(
model=model,
tools=[web_search, fetch_document],
name="researcher",
prompt="You are a research specialist. Find relevant, cited sources."
)
writer_agent = create_react_agent(
model=model,
tools=[format_text, add_citations],
name="writer",
prompt="You are a technical writer. Produce clear, well-structured content."
)
# Supervisor decides next agent dynamically—or FINISH
def supervisor_node(state: AgentState) -> dict:
messages = state["messages"] + [
SystemMessage(content="Decide next: researcher, writer, or FINISH.")
]
response = model.invoke(messages)
next_agent = "researcher" if "research" in response.content.lower() else "writer"
if "finish" in response.content.lower():
next_agent = "__end__"
return {"messages": [response], "current_agent": next_agent, "iteration": state.get("iteration", 0) + 1}
def route_supervisor(state: AgentState) -> Literal["researcher", "writer", "__end__"]:
# Hard cap—prevents infinite loops
if state.get("iteration", 0) >= 10:
return "__end__"
return state.get("current_agent", "researcher")
workflow = StateGraph(AgentState)
workflow.add_node("supervisor", supervisor_node)
workflow.add_node("researcher", research_agent)
workflow.add_node("writer", writer_agent)
workflow.add_edge(START, "supervisor")
workflow.add_conditional_edges("supervisor", route_supervisor, {
"researcher": "researcher", "writer": "writer", "__end__": END
})
workflow.add_edge("researcher", "supervisor")
workflow.add_edge("writer", "supervisor")
graph = workflow.compile()Typed Handoff Schema
from dataclasses import dataclass
@dataclass
class AgentHandoff:
summary: str
citations: list[str]
open_questions: list[str]
confidence: float
source_agent: str
def triage_agent(state: dict) -> dict:
"""Routes to specialist based on intent—no free-form text handoffs."""
response = model.invoke([
SystemMessage(content="Classify: billing, technical, or account."),
HumanMessage(content=state["user_message"])
])
intent = parse_intent(response.content)
return {
"messages": state["messages"] + [response],
"handoff": AgentHandoff(
summary=state["user_message"],
citations=[],
open_questions=[],
confidence=0.9,
source_agent="triage"
),
"next_agent": intent # "billing" | "technical" | "account"
}Sequential Pipeline
workflow = StateGraph(StateSchema)
workflow.add_node("researcher", research_agent)
workflow.add_node("writer", writer_agent)
workflow.add_node("reviewer", reviewer_agent)
# Linear flow—simple, debuggable
workflow.add_edge(START, "researcher")
workflow.add_edge("researcher", "writer")
workflow.add_edge("writer", "reviewer")
workflow.add_edge("reviewer", END)
pipeline = workflow.compile()Conversational Interview Q&A
"Describe a multi-agent system you built. Why multiple agents instead of one big one?"
Weak answer: "We use multiple agents for better modularity and separation of concerns. Each agent has its own responsibilities."
Strong answer: "At Maersk I built an enterprise AI platform with centralized LLM access, guardrails, and agent orchestration. We chose multiple agents for four reasons. First, specialization—customer support, document processing, and email booking need different prompts and tools. One agent with 50 tools would drown in context and make poor tool choices. Second, fault isolation—when the email extraction agent hits a weird format, it doesn't take down document processing. Third, independent scaling—the booking pipeline is I/O-bound waiting on external APIs; summarization is compute-bound. We scale them differently. Fourth, modularity—teams deploy and iterate on their agents independently. The platform enforces exactly one orchestrator—a supervisor pattern—that routes by intent. No coordination conflicts."
"How do you prevent runaway costs in multi-agent systems?"
Weak answer: "We use iteration limits and try to keep the number of agents low."
Strong answer: "Three levers: iteration limits, semantic caching, and per-request budgets. Every pipeline has a hard cap—10 or 15 steps—then we force termination. Non-negotiable. Semantic caching is huge: we cache responses for similar queries. 'What were Q3 revenue?' and 'Show me third quarter revenue' hit the cache. That can cut costs 80–90% for repeated queries. We track token usage per request and abort on threshold. Routing and simple classification use smaller models; only heavy lifting uses Sonnet. We use MLflow to track cost per request and per agent—when we see spikes, we know exactly which agent and which flow. Guardrails at the gateway limit abuse. We don't guess; we measure."
"Explain MCP. How does it help with tool integration?"
Weak answer: "MCP is a protocol for connecting agents to tools. It standardizes things."
Strong answer: "MCP is like USB-C for AI—one interface, any tool. Host-client-server: the host is your app (our platform), clients connect to MCP servers. Each server exposes tools (callable functions), resources (read-only data), and prompts (templates). Without MCP, every tool is custom integration—database API, error handling, wiring. With MCP, you build one client; any MCP-compatible server plugs in. We add a new data source by standing up an MCP server; agent code doesn't change. Our platform's tool access layer uses this pattern—teams bring their own MCP servers, we provide the gateway, guardrails, and LLM routing."
"What happens when one agent fails? How do you handle it?"
Weak answer: "We retry and sometimes have fallbacks."
Strong answer: "Layered approach. Retries with exponential backoff for transients—timeouts, rate limits—capped at three attempts. Validation at handoff boundaries: before passing to the next agent, we check structure. If extraction returns malformed JSON, we retry with a corrective prompt or fall back. Optional steps can be skipped; critical steps get a simplified fallback—e.g. rule-based extraction if the LLM fails. Circuit breakers: if an agent fails five times in a minute, we open the circuit and fail fast for a cooldown. In the email-to-booking system, low-confidence extractions or validation failures route to human review. We never lose a booking; we improve the agent over time. Every failure is logged with agent ID, step, and input hash—we can trace any run."
"How do you debug a multi-agent system?"
Weak answer: "We add logging and try to trace through the steps."
Strong answer: "Structured tracing is table stakes. OpenTelemetry gives us spans per agent—name, input summary, output summary, tokens, latency. We see the full path: request → supervisor → researcher → supervisor → writer → done. For the supervisor we log why it routed: 'Routed to researcher because research_data empty and user asked about X.' That answers 'why did it go there?' We checkpoint state at key points—if step 7 fails, we replay from step 5. Handoffs use typed schemas, so when the writer produces junk we inspect the researcher's handoff—was context missing? Low confidence? Often the bug is at the handoff. We maintain an eval dataset; when we change prompts we run it and catch regressions. MLflow gives us visibility into cost and behavior per request. Debugging goes from guesswork to systematic replay."
"Compare supervisor vs sequential pipeline. When do you use each?"
Weak answer: "Supervisor is more flexible; sequential is simpler. Use supervisor for complex stuff."
Strong answer: "Supervisor when task decomposition can't be predetermined. A research task might need one round or five—the supervisor examines each specialist's output and decides: more research, move to analysis, or draft. Enables iterative refinement. Tradeoff: coordination overhead, non-deterministic paths, harder to debug. Sequential when the dependency chain is fixed. Research → draft → edit → fact-check—each stage depends entirely on the previous. Simple, predictable, easy to trace. Tradeoff: rigidity, fragility—one failure halts everything. We use a hybrid: sequential main flow (e.g. extraction → enrichment → validation) with supervisor-style logic inside the enrichment stage for how many retrieval rounds to run."
"What's the biggest challenge you faced building multi-agent systems in production?"
Weak answer: "Making sure the agents work together and don't get stuck in loops."
Strong answer: "Handoff failures, not model failures. Agent A produces good output, but it's not formatted for Agent B—or we pass a summary and drop critical nuance. The fix: structured handoff schemas and explicit contracts. We invested in typed handoffs—summary, citations, confidence, open questions—and validation at boundaries. Tuning the supervisor prompt for correct routing had more impact than improving individual agents. Cost control was second: multi-agent multiplies LLM calls. Iteration limits, semantic caching, smaller models for routing, MLflow visibility—we track cost per request and per agent. Third: observability. Without OpenTelemetry and decision traces, debugging was guesswork. Fourth: org alignment. Platform team owns orchestration; domain teams own specialists. Clear contracts on handoffs, error handling, and SLAs let us iterate independently without breaking compatibility."
From Your Experience
1. Platform as Multi-Agent Orchestration
Your Maersk AI platform is a multi-agent orchestration system. You provide centralized LLM access, guardrails, policies, and tool integration. Different teams deploy specialist agents (customer support, document processing, email booking)—each with distinct prompts, tools, and sometimes models. The platform is the infrastructure: LLM gateway, guardrails (content filtering, usage controls), prompt and dataset management, evaluation pipelines. The orchestration layer routes requests by intent and context. How would you describe the "supervisor" in your system—is it LLM-based routing, code-based rules, or a hybrid? What handoff schemas do you enforce between the platform and domain agents?
2. Email-to-Booking Multi-Agent Design
Your email-to-booking system converts unstructured emails into structured bookings with RAG enrichment. Walk through the agent roles: ingestion/parsing (extract text, handle attachments), extraction (pull structured fields, confidence), RAG enrichment (validate and enrich from knowledge bases), validation/synthesis (consistency, final booking). Where does human-in-the-loop plug in? How do you pass context between stages—typed structs or raw text? What happens when RAG is down—do you still extract and flag for manual enrichment?
3. Safeguards You've Implemented
Against infinite loops: iteration limits, explicit completion signals, cycle testing. Against cost explosion: semantic caching, per-request budgets, smaller models for routing, MLflow tracking. How do guardrails at the gateway (max requests, content filters) interact with per-agent policies? What alerted you to cost spikes—and what did you do?
Quick Fire Round
-
What's the key benefit of multi-agent over single agent? Specialization, fault isolation, independent scaling, modularity.
-
Why exactly one supervisor? Two coordinators cause duplicated work, contradiction, race conditions.
-
When use sequential pipeline? When dependencies are fixed and linear; simple, debuggable.
-
What's MCP? Agent-to-tool protocol; host-client-server; tools, resources, prompts.
-
What's A2A? Agent-to-agent protocol; Agent Cards, task management; complements MCP.
-
How prevent infinite loops? Iteration limits, explicit "done" signals, cycle detection.
-
How reduce cost explosion? Semantic caching, iteration limits, smaller models for routing, per-request budgets.
-
What's the blackboard pattern? Shared data store; agents read/write; no direct agent-to-agent comm.
-
Why structured handoffs? Typed schemas prevent context loss and improve debuggability.
-
MCP transport options? Stdio (local), Streamable HTTP (remote).
-
What's agent specialization? Focused role, tailored prompt, curated tools, sometimes different model per agent.
-
How debug multi-agent? OpenTelemetry spans, decision traces, checkpointing, typed handoffs, eval datasets.
-
When use parallel fan-out? When subtasks are independent; reduces latency.
-
Debate/consensus pattern tradeoff? High quality through adversarial refinement; very expensive.
-
Human-in-the-loop as fallback? Route low-confidence or validation-failed extractions to humans instead of failing.
Key Takeaways (Cheat Sheet)
| Topic | Key Point |
|---|---|
| Why multi-agent | Single agent hits limits (context, tools, roles). Multi-agent: specialization, fault isolation, scaling, modularity. Start simple. |
| Supervisor pattern | Central orchestrator; exactly one. Decomposes, delegates, synthesizes. LLM or code-based routing. |
| Sequential pipeline | Linear A → B → C. Simple, debuggable. High latency, fragile. Use when deps fixed. |
| Hierarchical | Supervisor → sub-supervisors → agents. Multi-domain. |
| Parallel fan-out | Same/different tasks to multiple agents; merge. Independent subtasks. |
| Inter-agent comm | Shared state, message passing, event-driven, blackboard. Use structured handoffs. |
| MCP | Agent-to-tool. Host → Client → Server. Tools, Resources, Prompts. Stdio or HTTP. |
| A2A | Agent-to-agent. Agent Cards, task management. Complements MCP. |
| Handoffs | Typed: summary, citations, confidence, open questions. No free-form text. |
| Production | Race conditions, loops, cost explosion, error propagation, traces, latency, non-determinism. Limits, caching, validation, observability. |
Further Reading
-
Anthropic: Building Effective Agents — Workflows vs agents, five patterns, when to reach for frameworks (and when not to).
-
Model Context Protocol — Architecture — Host, client, server, primitives. The official spec.
-
Agent2Agent Protocol (A2A) — Google's inter-agent standard; Agent Cards and task management.
-
LangGraph Multi-Agent Guide — Supervisor implementation, handoffs, state—hands-on.
-
Anthropic: Demystifying Evals for AI Agents — How to evaluate multi-turn agent systems without losing your mind.