AI Engineer Prep

Session 8: Tool Integration, Function Calling & MCP

An LLM without tools is a brain in a jar—it can think, but it can't DO anything. It can wax poetic about the weather in Paris or speculate about your calendar, but it's all hallucination. The moment you give it a get_weather tool, everything changes. Suddenly it can fetch real data, create real bookings, query real databases. That's the power move in AI engineering: tools are how you turn a fancy autocomplete into something that actually matters.

Here's the thing nobody tells you upfront: tool integration is where most agent systems either shine or crumble. Get the schema wrong and the model picks the wrong tool half the time. Skip security and you've got a runaway agent deleting production data. Ignore error handling and your users see "I'm sorry, something went wrong" when the weather API hiccups. Senior AI Engineer interviews drill into this because it's the difference between a demo that wows in a Slack thread and a system that ships to thousands of users.

This session covers function calling (OpenAI and Anthropic), the Model Context Protocol (MCP), schema design, orchestration patterns, security, and error handling. By the end, you'll understand not just how tools work but why certain decisions make or break production systems. And you'll have the Maersk context to back it up.


1. What Is Tool/Function Calling?

Interview Insight: Interviewers want to confirm you understand the fundamental model: the LLM requests, your code executes. They're probing whether you've actually built this loop or just read about it.

Tool calling (also called function calling) is how LLMs request execution of external functions. The model doesn't run code—it outputs structured JSON with a function name and arguments. Your application parses that, runs the corresponding function, and sends the result back. The model then weaves that data into its response.

Think of it like a restaurant: the model is the waiter who takes your order and writes it down. The kitchen (your codebase) actually cooks. The waiter never touches the stove—they just relay requests and bring back the food. That separation is deliberate: the model suggests; you decide what gets executed. No arbitrary code, no untrusted execution. You control the tools, the schemas, and the safety.

Technical flow: User asks "What's the weather in Paris?" → Model receives the prompt + tool list → Model outputs {"name": "get_weather", "arguments": {"location": "Paris"}} → Your code calls get_weather("Paris") → Result goes back to the model → Model produces "The weather in Paris is 22°C."

Why This Matters in Production: Every tool call is a trust boundary. The model's output is untrusted until validated. You must sanitize inputs, validate outputs, and never assume the model will "do the right thing" with sensitive operations.

Aha Moment: The model never sees your source code. It only sees tool names and descriptions. Bad descriptions = wrong tool selection. This is why schema design is half the battle.


2. OpenAI Function Calling

Interview Insight: They want to know you've shipped with OpenAI's API—tools parameter, tool_choice, handling tool_calls, returning function_call_output. The details matter.

OpenAI's function calling uses a tools parameter—a list of tool definitions. Each has type: "function", name, description, and parameters (JSON Schema). The description is in the prompt; it's how the model knows when to call the tool. Vague descriptions = wrong calls.

tool_choice: "auto" (default) lets the model decide. "required" forces at least one tool call. "none" disables tools. You can also force a specific function: {"type": "function", "function": {"name": "get_weather"}}. As of 2025, allowed_tools restricts the model to a subset while keeping the full list for prompt caching.

Response structure: When the model calls a tool, you get finish_reason: "tool_calls" and an array of function_call items with call_id, name, and arguments (JSON string). Execute the function, then append a message with type: "function_call_output", matching call_id, and your result. Make a second API call with the updated conversation.

Parallel tool calls: The model can request multiple tools in one response. Execute them all (concurrently with asyncio.gather), return each with its call_id. Set parallel_tool_calls: false to limit to one per turn. Note: parallel calls aren't supported with built-in tools (web search, MCP).

Strict mode: strict: true enforces exact schema conformance. All properties required (use null for optional), additionalProperties: false. Recommended for production.

Why This Matters in Production: Mismatched call_ids or malformed tool results break the loop. You need robust parsing and error handling so one bad tool doesn't kill the entire agent flow.

Aha Moment: OpenAI's arguments come as a JSON string. You must parse it. Anthropic's Claude gives you parsed objects—different defaults, different gotchas.


3. Anthropic Tool Use

Interview Insight: They're checking if you understand provider differences. Claude uses tool_use blocks and tool_result in user message content—not a separate role.

Anthropic's API is conceptually similar but structurally different. Tools have name, description, and input_schema (not parameters). You get stop_reason: "tool_use" when the model wants to call tools. The content array has text blocks and tool_use blocks. Each tool_use has id, name, and inputparsed arguments, not a string.

To return results: add a new user message with tool_result content blocks. Each has tool_use_id (matching the id from tool_use) and content (your result string). All tool_result blocks must match all tool_use blocks—missing IDs cause API errors.

Client vs. server tools: Anthropic distinguishes client tools (you implement) from server tools (Anthropic runs—e.g., web search). Server tools use versioned types like web_search_20250305 and run server-side; you don't handle execution.

Why This Matters in Production: Claude's content array can mix text and tool_use in one message. Your loop logic must handle that—extract tool_use blocks, execute, build tool_result blocks, append, and continue until stop_reason is "end_turn".

Aha Moment: Use input_examples for complex schemas. Anthropic recommends them; they dramatically improve parameter filling for nested or ambiguous structures.


4. Tool Schema Design

Interview Insight: This is where senior engineers stand out. "I wrote clear descriptions" sounds trivial until you've debugged a model calling search when it meant find for the tenth time.

The schema is the model's interface to your tools. No source code, no docstrings—just name, description, and parameter schema. If the model misunderstands, it's usually the schema's fault.

JSON Schema: Use type, properties, required, description, enum, additionalProperties, nested objects. Full expressiveness. Constrain inputs to reduce invalid calls.

Descriptions are everything: "Gets data" is useless. "Retrieves current weather for a given city. Use when the user asks about temperature, conditions, or forecasts." is actionable. Each parameter needs: what it is, what format, any constraints. "City and country, e.g. San Francisco, CA" beats "The location."

Required vs. optional: Mark only truly necessary params as required. Optional = flexibility but also more omission/mis-specification. Use enum for constrained choices.

Anti-patterns: Vague descriptions, too many parameters, overlapping tools (search vs find), redundant tools. Fewer tools with clear names beat many ambiguous ones. OpenAI recommends fewer than 20; with 50+, use dynamic loading or routing.

Why This Matters in Production: At Maersk, your booking agent needs tools like create_booking, search_sailings, get_quote. Overlap those and the model confuses create vs. search. Distinct names and explicit "use when" / "do not use when" clauses prevent that.

Aha Moment: Dynamic tool loading: pass only the 5–10 tools relevant to the current task. A calendar agent doesn't need weather. A summarization agent doesn't need delete. Router → subset → model.


5. MCP (Model Context Protocol)

Interview Insight: MCP is hot. They want to know: What problem does it solve? How does the architecture work? How is it different from LangChain tools?

MCP is like USB-C for AI—one standard plug for everything. Before MCP, every tool integration was custom: weather API, database, email, file system—each with its own API shape, auth, error handling. MCP says: write one server per tool group, any MCP client can discover and use it. Write once, plug in everywhere.

Architecture: MCP Host (Cursor, Claude Desktop, VS Code) coordinates. MCP Client connects to a specific server, discovers tools via list_tools(), routes requests, returns results. MCP Server exposes tools, resources, and prompts. Host ↔ multiple Clients ↔ each Client ↔ one Server (1:1).

What servers expose: Tools (callable functions with schemas), Resources (read-only data—files, DB queries), Prompts (reusable templates). Tools = model-controlled. Resources = application-controlled. Prompts = user-controlled.

Transport: Stdio for local (subprocess, stdin/stdout, newline-delimited JSON-RPC). Streamable HTTP for remote (single endpoint, POST + optional GET for streaming). SSE deprecated in favor of Streamable HTTP.

Security: MCP enables powerful capabilities—arbitrary data, code execution. Spec mandates user consent, data privacy, tool safety. Hosts must obtain explicit consent before invoking tools. Treat tool descriptions from untrusted servers as untrusted.

flowchart TB
    subgraph mcpHost["MCP Host e.g. Cursor"]
        mcpClient1[MCP Client 1]
        mcpClient2[MCP Client 2]
    end
    subgraph mcpServers[MCP Servers]
        dbServer[Server: Database]
        weatherServer[Server: Weather]
    end
    dbServer -->|"Tools: query, insert"| mcpClient1
    weatherServer -->|"Tools: get_weather"| mcpClient2
    mcpClient1 --> mcpHost
    mcpClient2 --> mcpHost

Why This Matters in Production: On an enterprise platform like Maersk's Agent Platform, you might run multiple MCP servers—booking APIs, scheduling, RAG sources. Stdio for local dev; HTTP for scalable deployment. One protocol, many backends.

Aha Moment: MCP isn't LangChain. LangChain tools are framework-specific. MCP is protocol-agnostic. An MCP server in Python works with Cursor, Claude Desktop, or your custom app. No lock-in.


6. Tool Orchestration Patterns

Interview Insight: They want to see you think in systems—sequential vs. parallel, when to route, when to retry, when to fall back.

Sequential: Search → fetch document → summarize. Each step depends on the previous. Simple, debuggable. Latency adds up.

Parallel: Weather for Paris, London, Tokyo in one turn. Independent tools, fire them all. Both OpenAI and Anthropic support this. Use asyncio.gather.

Conditional: If weather query → get_weather. If stocks → get_stock_price. Routing logic or curated subset.

Iterative: Search returns too many hits → model narrows query → search again. Loop with iteration limits.

Fallback: Primary API times out → try backup or cached result. Try/except + fallback registration.

sequenceDiagram
    participant User
    participant App
    participant Model
    participant Tool
    User->>App: "What's the weather in Paris?"
    App->>Model: Prompt + tools list
    Model->>App: tool_call get_weather "Paris"
    App->>Tool: Execute get_weather Paris
    Tool->>App: temp 22 celsius
    App->>Model: tool_result
    Model->>App: "The weather in Paris is 22C."
    App->>User: Final response

Why This Matters in Production: Your email booking agent: extract from email (tool 1) → RAG lookup for sailing options (tool 2) → create booking (tool 3) → human-in-the-loop confirmation. Sequential with a branch. Get the order wrong and you create a booking before you have the right sailing.

Aha Moment: Parallel doesn't mean independent. Sometimes tools have implicit dependencies. Model usually sequences correctly, but if it doesn't, you need ordering logic.


7. Secure Tool Execution

Interview Insight: Security questions separate seniors from juniors. Sandboxing, validation, least privilege, confirmation for destructive ops.

Sandboxing: Run tool code in containers or VMs. A buggy or malicious tool shouldn't compromise the host.

Input validation: Validate before execution. Types, ranges, formats. Sanitize for injection (SQL, shell). Pydantic or similar. Reject invalid before it hits the implementation.

Output validation: Filter PII before passing to the LLM. Validate format. Truncate huge payloads. Error structures → consistent format for the model to interpret.

Destructive operations: Two-step flow. Tool returns "This will delete X. Confirm?" Separate confirmation tool or user approval before execution.

Rate limiting: Per user, per session, per tool. Runaway agents can make thousands of calls.

Least privilege: Summarization agent doesn't need delete_database. Dynamic tool loading, permission systems.

Why This Matters in Production: On an enterprise platform with guardrails and policies, tool access is gated. Booking creation might require approval workflows. Deleting records = extra confirmation. You're not just securing the model—you're enforcing business rules.

Aha Moment: Tool descriptions from external MCP servers are untrusted. A malicious server could describe "get_weather" but execute "rm -rf". Users must understand what they're authorizing.


8. Error Handling in Tool Calls

Interview Insight: "What happens when the tool fails?" They want retries, timeouts, fallbacks, and what you tell the model.

Retries + exponential backoff: Transient failures often succeed on retry. 1s, 2s, 4s. Max retry count.

Timeouts: Hard timeouts so a stuck tool doesn't block forever. Document expected latency in tool descriptions.

Fallback: Backup API, cached result, graceful degradation. "I couldn't fetch live data; here's what I know."

Error messages to the LLM: Descriptive. "API rate limit exceeded; try again in 60 seconds" beats "Error 429." Model can retry, suggest waiting, or try a different approach.

Circuit breaker: N consecutive failures → stop calling temporarily. Cooldown. Prevents cascading when downstream is down.

flowchart TD
    toolCall[Tool Call Request] --> execute{Execute Tool}
    execute -->|Success| returnResult[Return Result to Model]
    execute -->|Failure| retryCheck{Retry?}
    retryCheck -->|Yes attempts left| backoff[Exponential Backoff]
    backoff --> execute
    retryCheck -->|No| fallbackCheck{Fallback Available?}
    fallbackCheck -->|Yes| tryFallback[Try Fallback Tool]
    tryFallback --> execute
    fallbackCheck -->|No| returnError[Return Error to Model]
    returnError --> modelAdjusts[Model Adjusts Strategy]

Why This Matters in Production: Booking APIs flake. RAG services timeout. You need the agent to say "I couldn't complete the booking—please try again or contact support" instead of crashing. And you need observability (MLflow, logs) to know when failure rates spike.

Aha Moment: Return errors in a format the model can act on. Structure matters. "Error: timeout. Suggestion: retry in 30s or use manual booking link."


9. Tool Selection Strategies

Interview Insight: "The model keeps picking the wrong tool." Classic scaling problem. How do you fix it?

With 50+ tools, models get confused. Descriptions blur. Parameter schemas add cognitive load.

Solutions: (1) Dynamic tool selection—only pass relevant tools. (2) Better descriptions—distinct, explicit, "when to use" / "when not to use." (3) Consolidate overlapping tools—one search_docs not search and find. (4) Hierarchical routing—classifier picks category, then pass subset. (5) Few-shot examples in system prompt. (6) Fine-tuning for function calling (OpenAI supports this). (7) Monitor and iterate—log tool→query pairs, refine descriptions from misselection patterns.

Why This Matters in Production: On a platform with many agents and many tools, you need a tool registry that can filter by agent type, permission, and context. Not every agent sees every tool.

Aha Moment: OpenAI recommends <20 tools per request. If you have 50, you're doing it wrong—or you need a router in front.


Code Examples

OpenAI Function Calling: Complete Flow

from openai import OpenAI
import json
 
client = OpenAI()
 
tools = [
    {
        "type": "function",
        "name": "get_weather",
        "description": "Retrieves current weather for the given location.",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City and country, e.g. Paris, France"},
                "units": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "Temperature units."}
            },
            "required": ["location", "units"],
            "additionalProperties": False
        },
        "strict": True
    }
]
 
def get_weather(location: str, units: str) -> dict:
    # In production, call a real weather API
    return {"temperature": 22, "conditions": "sunny", "unit": units}
 
messages = [{"role": "user", "content": "What's the weather in Paris in Celsius?"}]
response = client.chat.completions.create(model="gpt-4o", messages=messages, tools=tools, tool_choice="auto")
assistant_message = response.choices[0].message
 
if assistant_message.tool_calls:
    messages.append(assistant_message)
    for tool_call in assistant_message.tool_calls:
        if tool_call.function.name == "get_weather":
            args = json.loads(tool_call.function.arguments)
            result = get_weather(args["location"], args["units"])
            messages.append({"role": "tool", "tool_call_id": tool_call.id, "content": json.dumps(result)})
    final_response = client.chat.completions.create(model="gpt-4o", messages=messages, tools=tools)
    print(final_response.choices[0].message.content)
else:
    print(assistant_message.content)

Anthropic Tool Use: Complete Flow

import anthropic
client = anthropic.Anthropic()
 
tools = [{
    "name": "get_weather",
    "description": "Get the current weather in a given location",
    "input_schema": {
        "type": "object",
        "properties": {
            "location": {"type": "string", "description": "City and state, e.g. San Francisco, CA"},
            "unit": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "Temperature unit"}
        },
        "required": ["location"]
    }
}]
 
def get_weather(location: str, unit: str = "celsius") -> str:
    return f"22°{unit[0].upper()}, sunny"
 
response = client.messages.create(
    model="claude-sonnet-4-20250514", max_tokens=1024, tools=tools,
    messages=[{"role": "user", "content": "What's the weather in San Francisco?"}]
)
 
while response.stop_reason == "tool_use":
    tool_use_blocks = [b for b in response.content if b.type == "tool_use"]
    tool_results = []
    for block in tool_use_blocks:
        if block.name == "get_weather":
            result = get_weather(block.input["location"], block.input.get("unit", "celsius"))
            tool_results.append({"type": "tool_result", "tool_use_id": block.id, "content": result})
    messages = list(response.content)
    messages.append({"role": "user", "content": tool_results})
    response = client.messages.create(
        model="claude-sonnet-4-20250514", max_tokens=1024, tools=tools, messages=messages
    )
print(response.content[0].text)

Building an MCP Server with FastMCP

from fastmcp import FastMCP
mcp = FastMCP("Weather Server")
 
@mcp.tool()
def get_weather(location: str, unit: str = "celsius") -> str:
    """Get the current weather for a location.
    Args:
        location: City and country, e.g. Paris, France
        unit: Temperature unit: celsius or fahrenheit
    """
    return f"Weather in {location}: 22°{unit[0].upper()}, sunny"
 
@mcp.tool()
def get_forecast(location: str, days: int = 3) -> str:
    """Get a multi-day forecast for a location."""
    return f"Forecast for {location}: Sunny for the next {days} days"
 
if __name__ == "__main__":
    mcp.run()

MCP Client: Discover and Call Tools

from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
import asyncio
 
async def main():
    server_params = StdioServerParameters(command="python", args=["weather_server.py"])
    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()
            tools = await session.list_tools()
            for tool in tools.tools:
                print(f"Tool: {tool.name} - {tool.description}")
            result = await session.call_tool("get_weather", {"location": "Paris, France", "unit": "celsius"})
            print(result.content)
 
asyncio.run(main())

Tool Error Handling with Retries

import time
from typing import Callable, Any
 
def with_retry(fn: Callable, max_retries: int = 3, base_delay: float = 1.0, backoff: float = 2.0) -> Callable:
    def wrapper(*args, **kwargs) -> Any:
        last_error = None
        for attempt in range(max_retries):
            try:
                return fn(*args, **kwargs)
            except Exception as e:
                last_error = e
                if attempt < max_retries - 1:
                    time.sleep(base_delay * (backoff ** attempt))
        raise last_error
    return wrapper

Input Validation with Pydantic

from pydantic import BaseModel, Field
 
class GetWeatherInput(BaseModel):
    location: str = Field(..., min_length=1, max_length=200)
    unit: str = Field(default="celsius", pattern="^(celsius|fahrenheit)$")
 
def get_weather_validated(location: str, unit: str) -> dict:
    try:
        validated = GetWeatherInput(location=location, unit=unit)
    except Exception as e:
        return {"error": str(e), "suggestion": "Use format: city, country"}
    return {"temperature": 22, "unit": validated.unit}

Architecture: Tool Call Lifecycle

flowchart LR
    userMsg[User Message] --> modelIn[Model Input]
    modelIn --> modelOut[Model Output]
    modelOut --> hasTools{Tool Calls?}
    hasTools -->|Yes| execute[Execute Tools]
    execute --> append[Append Results]
    append --> modelIn
    hasTools -->|No| final[Final Response]

Conversational Interview Q&A

Q1: "How do you handle a tool call that fails or times out in production?"

Weak answer: "We catch the exception and return an error message to the user."

Strong answer: "Layered approach. Hard timeouts—10–30 seconds per tool—so a stuck call doesn't block the agent. Retries with exponential backoff for transient failures: 2–3 retries, 1s/2s/4s. I return structured error messages to the LLM—'Weather API timed out after 10s'—so the model can inform the user or try an alternative. For critical tools like booking creation, we have fallbacks: cached options or 'please try manual booking.' Circuit breakers after N consecutive failures so we don't hammer a down service. At Maersk, our email booking agent logs failures to MLflow—when the booking API starts failing, we see it immediately and can alert."


Q2: "Explain MCP architecture. How is it different from LangChain tools?"

Weak answer: "MCP is a protocol for tools. LangChain has tools too. They're similar."

Strong answer: "MCP has three layers: Host (Cursor, Claude Desktop), Client (connects to a server), Server (exposes tools, resources, prompts). Clients discover via list_tools(), execute via call_tool(). JSON-RPC over stdio or Streamable HTTP. LangChain tools are framework-specific—they work in the LangChain ecosystem. MCP is protocol-agnostic. Write a server once, any host that speaks MCP can use it. No lock-in. MCP also standardizes resources and prompts, which LangChain doesn't. On our Agent Platform at Maersk, we're evaluating MCP for standardizing tool access across agents—one server per capability area, pluggable into different runtimes."


Q3: "How do you validate tool outputs before passing them back to the LLM?"

Weak answer: "We check if it's valid JSON."

Strong answer: "Schema validation with Pydantic—structure and types. PII filtering: scan for emails, phone numbers, SSNs, redact before sending to the model. Size limits—truncate or summarize long outputs. Format normalization so the model gets consumable data. Error structures formatted consistently: 'Error: [message]. Suggestion: [action].' In compliance-heavy environments we log what was sent for audit. Our booking agent returns customer and sailing data—we filter any PII that shouldn't be in the prompt and cap response length so we stay within context limits."


Q4: "An agent has 50 tools. The model keeps picking the wrong one. How do you fix it?"

Weak answer: "Improve the descriptions."

Strong answer: "Dynamic tool loading first—only pass relevant tools. Use a lightweight classifier: is this a booking query? Calendar? Search? Pass 5–10 tools max. Second, make descriptions distinct: 'Use when the user wants to create a new booking' vs 'Use when the user wants to search existing bookings.' Consolidate overlapping tools—one search, not search and find. Hierarchical routing: first call classifies task, second gets only that category's tools. Few-shot examples in the system prompt. Monitor tool→query pairs, iterate on descriptions from misselections. At Maersk we have create_booking, search_sailings, get_quote—very different names, explicit 'when to use' so the model doesn't confuse them."


Q5: "How do you secure tool execution? What if a tool has destructive side effects?"

Weak answer: "We validate inputs and don't give delete access."

Strong answer: "Least privilege—agents get only tools they need. Input validation: schema, ranges, sanitization for injection. For destructive ops—DB deletes, file overwrites—two-step confirmation: tool returns preview, 'This will delete 50 records. Confirm?' Separate confirmation mechanism before execution. Sandbox high-risk tools in containers. Rate limit per user/session. Audit and log all invocations. Treat tool descriptions from external MCP servers as untrusted. On our platform we gate booking creation behind human-in-the-loop; destructive operations aren't even exposed to most agents."


Q6: "Walk through building an MCP server. What design decisions matter?"

Weak answer: "Use FastMCP, add tools with decorators, run it."

Strong answer: "Transport first: stdio for local (Cursor, CLI) or Streamable HTTP for remote (scalable, multi-client). Tool granularity: one per operation vs. one with a type parameter—finer gives more control, coarser reduces count. Schema design: JSON Schema, clear descriptions, enums for choices. Error handling: structured errors clients can pass to the LLM. Resources vs. tools: read-only data as resources (client-fetched), actions as tools (model-initiated). Auth for HTTP: API keys, OAuth. State: stateless vs. session state. At Maersk we'd build one server for booking tools, one for RAG—stdio for dev, HTTP for prod. Test with MCP inspector before production."


Q7: "How do you handle parallel tool calls? What challenges arise?"

Weak answer: "We run them in parallel with asyncio."

Strong answer: "asyncio.gather for concurrent execution. Collect all results, return each with correct call_id or tool_use_id. Challenges: ordering—results must match request IDs; partial failure—return what succeeded, error for failures so model can adapt; per-tool timeouts so one hang doesn't block others; implicit dependencies—model usually sequences, but sometimes we need ordering logic; rate limits—parallel can hit faster, may need throttling. At Maersk our booking flow is mostly sequential by design, but for things like fetching sailing options and customer profile we could parallelize. Note: OpenAI says parallel_tool_calls isn't supported with built-in/MCP tools in some configs."


From Your Experience (Maersk-Tailored Prompts)

1. Your platform provides "tools access" to agents. Walk through the architecture.

Describe: how tools are defined (schema format), how they're registered, how the agent runtime discovers and invokes them, how results flow back. Mention ToolRegistry or base abstractions. How do you support both OpenAI and Anthropic tool formats? How does tool access integrate with your guardrails and policies?

2. How does the email booking automation agent handle tool failures?

Your agent extracts info from emails, uses RAG, creates bookings, has human-in-the-loop. When create_booking fails (API down, validation error), what happens? Retries? Fallback to manual? Error message to user? Circuit breaker or alerting? How do you log and observe these in MLflow?

3. Would you use MCP for the Agent Platform, or a custom protocol? Why?

You have centralized LLM models, guardrails, tool access, observability. How would MCP fit? Stdio for local tools, HTTP for deployment? One server per capability vs. one monolithic server? What would a custom protocol need that MCP doesn't provide?


Quick Fire Round

Q: What does the model actually output when it wants to call a tool?
A: Structured JSON: function name + arguments. Not executable code.

Q: What's the difference between OpenAI's arguments and Anthropic's input?
A: OpenAI: JSON string. Anthropic: parsed object. You must parse OpenAI's.

Q: When should you use tool_choice: "required"?
A: When you want to force the model to use at least one tool—e.g., you need data before answering.

Q: What's MCP's main advantage over custom integrations?
A: Write once, plug anywhere. Protocol-agnostic. Any MCP client can use any MCP server.

Q: Stdio vs. Streamable HTTP for MCP?
A: Stdio: local, subprocess, lowest latency. HTTP: remote, scalable, auth, multi-client.

Q: How many tools should you pass per request?
A: ~10–15. With 50+, use dynamic loading or routing.

Q: What's the circuit breaker pattern for tools?
A: After N consecutive failures, stop calling for a cooldown. Re-enable later. Prevents cascading failures.

Q: Why are tool descriptions critical?
A: Model has no access to source code. Descriptions are the only way it knows when and how to use a tool.

Q: What's strict mode in OpenAI function calling?
A: strict: true enforces exact schema match. Required fields, additionalProperties: false. Recommended for production.

Q: How do you handle parallel tool calls when one fails?
A: Return results for successes, structured error for the failed one. Model can adapt. Don't block all on one failure.

Q: What are MCP resources vs. tools?
A: Resources: read-only, client fetches. Tools: model-initiated, callable functions. Different control flows.

Q: What's the two-step confirmation for destructive operations?
A: Tool returns preview ("This will delete X. Confirm?"). Separate confirmation mechanism before actual execution.

Q: What's allowed_tools in OpenAI (2025)?
A: Restrict model to a subset of tools while keeping full list for prompt caching.


Key Takeaways (Cheat Sheet)

Topic Key Point
Tool calling Model requests; your code executes. Request–response protocol.
Schema design Clear descriptions, fewer tools, distinct names. Descriptions = model's only interface.
OpenAI vs Anthropic OpenAI: tool_calls, function_call_output, arguments as string. Anthropic: tool_use, tool_result in user content, input parsed.
MCP Protocol-agnostic. Host–Client–Server. Tools, resources, prompts. Stdio or Streamable HTTP.
Security Input/output validation, sandboxing, least privilege, confirmation for destructive ops, rate limiting.
Error handling Retries + backoff, timeouts, fallbacks, descriptive errors to LLM, circuit breaker.
Tool selection <20 tools per request. Dynamic loading, routing, better descriptions for scale.
Parallel calls Execute concurrently, return each with correct ID. Handle partial failure, timeouts, rate limits.

Further Reading