Session 11: LLMOps, Observability & Cost Management
Your agent just went rogue and burned through $500 in API calls in 10 minutes. How long before someone noticed? If your answer is "we'd check the bill at the end of the month"—you need this session.
Traditional monitoring gives you HTTP 200s and request counts. Congratulations: your system is "up." But your LLM is hallucinating refund policies, your RAG retriever is pulling irrelevant docs, and your agent is calling the same tool twelve times per request. Nobody's paging you, because technically nothing "failed." That's the illusion. In production AI systems, the difference between "working" and "catastrophically wrong" is invisible without the right observability stack.
This session covers what actually matters: tracing every step of a request, tracking costs before they surprise you, versioning prompts so you can roll back in seconds, and responding to incidents when quality tanks. We'll ground it in real tools—MLflow, LangSmith, Phoenix, OpenTelemetry—and in what we learned building AI agents at scale. No fluff. Let's go.
1. What is LLMOps
Interview Insight: Interviewers want to know you understand that LLMOps isn't just "MLOps with bigger models." The whole mental model shifts: you're not training and deploying artifacts; you're orchestrating prompts, retrieval, and API calls. Show you get the nuance.
Analogy: Think of LLMOps like running a restaurant kitchen. MLOps is the head chef training sous chefs and shipping recipes. LLMOps is managing a kitchen that orders pre-made dishes from external vendors (the API). Your "recipe" is the prompt, the retrieval layer, and the tool integrations. You don't control how the vendor makes the dish—you control what you order, how you combine it, and how you serve it. And every dish can taste slightly different (non-determinism), so "did it come out right?" is a harder question than "did it arrive?"
LLMOps is the operational discipline for deploying and maintaining LLM applications in production. It evolved from MLOps but differs fundamentally. In traditional MLOps, you train models on your data, version them, and deploy artifacts. With LLMs, you typically consume models via API. The "model" in LLMOps is the entire system: the prompt, retrieval layer, orchestration logic, and tool integrations. Evaluation is harder because outputs are non-deterministic—same input, different outputs across runs. You can't rely on exact-match metrics; you need semantic and probabilistic quality measures.
The core pillars: observability (what happened?), evaluation (was it good?), prompt management (what version produced this?), cost management (how much did we spend?), and deployment (how do we ship and scale?). LLM apps face unique challenges: token-by-token generation with unpredictable latency, volatile costs, probabilistic outputs, and prompts that behave like code—small changes, big behavior shifts.
Why This Matters in Production: Without LLMOps discipline, you're flying blind. You ship an agent, it "works" in staging, and then production usage explodes costs, or quality drifts, or a provider updates their model and your prompts break. You need the full lifecycle: data management, evaluation beyond traditional metrics, version control for prompts, and continuous monitoring.
Aha Moment: The "model" isn't the base LLM—it's the whole stack. When something goes wrong, it could be the prompt, the retriever, the tool, or the provider. Observability lets you see which.
2. Observability for LLM Applications
Interview Insight: Expect "How do you monitor LLM applications?" The weak answer is "we check error rates." The strong answer names traces, metrics, and logs, and explains why HTTP 200 isn't enough.
Analogy: A hospital monitors vital signs (heart rate, blood pressure). But that doesn't tell you if the patient is getting better. You need the full chart: what medication was given, when, how the patient responded. LLM observability is the same: status codes tell you the request completed; traces tell you what completed—retrieval, LLM call, tool call—and whether each step was correct.
Traditional monitoring—HTTP status codes, request counts, error rates—is insufficient. An HTTP 200 does not mean the answer is correct. The model could have hallucinated, retrieved garbage context, or produced a harmful response. You need to observe quality, not just availability.
The three pillars: traces, metrics, and logs. Traces show the execution path: request → retriever → LLM → tool → LLM again → response. Each step is a span with parent-child relationships. You see exactly where latency spiked or where something broke. Metrics aggregate: latency percentiles (P50, P95, P99), token usage, cost per request, quality scores (faithfulness, relevancy), error rates. Logs capture raw inputs/outputs—prompt sent, completion received, retrieved docs. Be careful: logs may contain PII; use masking or sampling in production.
What to observe: every LLM call (prompt, completion, tokens, latency, cost), every retrieval (query, results, relevance scores), every tool call (input, output, errors), and the end-to-end request. In multi-agent systems, observe the full graph—which agent ran, what it passed, where the chain broke.
Why This Matters in Production: You cannot fix what you cannot see. A user reports "the agent gave wrong info." Without traces, you're guessing. With traces, you pinpoint: bad retrieval? Bad prompt? Model hallucination?
Aha Moment: Production-grade LLM observability combines automatic instrumentation (OpenTelemetry, framework tools) with custom metrics for business-specific quality. The key insight: availability ≠ correctness.
3. MLflow for LLM Tracking
Interview Insight: "What observability tools have you used?" MLflow is a strong answer—open source, OpenTelemetry-compatible, works with 20+ GenAI libraries. Bonus: mention
mlflow-tracingfor production (lightweight footprint).
Analogy: MLflow is like a flight recorder for your LLM pipeline. Every step gets logged—inputs, outputs, timing—so when something goes wrong, you rewind and see exactly what happened. No vendor lock-in; the data stays on your infrastructure.
MLflow Tracing is a fully OpenTelemetry-compatible observability solution for GenAI apps. It captures inputs, outputs, and metadata for each step of a request. Use cases: debugging in IDEs/notebooks, production monitoring (latency, token usage), evaluation, human feedback collection, and dataset creation from production traffic.
One-line autolog: mlflow.openai.autolog(), mlflow.anthropic.autolog(), and similar for LangChain, LlamaIndex, DSPy, Vercel AI, and 20+ libraries. For production, use the lightweight mlflow-tracing package (95% smaller than full MLflow) with async logging so trace collection doesn't block request handling. MLflow Evaluate adds built-in metrics (faithfulness, relevancy, toxicity) and custom evaluators. Experiment tracking logs parameters (model, temperature, prompt version), metrics (quality, latency, cost), and artifacts.
Why This Matters in Production: At scale, you need trace data without killing performance. The
mlflow-tracingpackage gives you production-grade observability with minimal overhead. Your trace data never leaves your infra.
Aha Moment: MLflow isn't just for experiments. It's a full lifecycle platform: trace in prod, evaluate on traces, version prompts as "models," run regressions. One tool, end to end.
4. LangSmith
Interview Insight: If you're on LangChain/LangGraph, LangSmith is the path of least resistance. Zero code changes beyond env vars. Know the tradeoff: managed vs. self-hosted for data residency.
Analogy: LangSmith is like having a dedicated scribe for your LangChain app. Set one env var, and every call gets logged automatically—no instrumentation code to maintain. It's the "batteries included" option for the LangChain ecosystem.
LangSmith is LangChain's unified observability and evaluation platform. It works with any LLM framework—OpenAI SDK, Anthropic, Vercel AI, LlamaIndex—but has the tightest integration with LangChain and LangGraph. Automatic trace capture: set LANGCHAIN_TRACING_V2=true and your API key. Every model call, memory access, and tool action is captured.
Token usage and costs break down into Input (text, cache reads, image tokens), Output (text, reasoning, image tokens), and Other (tool calls, retrieval). LangSmith auto-records for major providers and supports usage_metadata for custom models. Evaluation: datasets, evaluators (LLM-as-judge or custom Python), and as of 2025, online evaluations—real-time feedback on production traces. Monitoring dashboards: token usage, latency percentiles, error rates, cost breakdowns, feedback scores. Deployment: managed cloud (GCP), BYOC, or self-hosted.
Why This Matters in Production: When you're iterating fast on an agent, the last thing you want is manual instrumentation. LangSmith gives you visibility out of the box. For data residency, BYOC or self-hosted keeps your data where you need it.
Aha Moment: LangSmith's online evaluations mean you can run LLM-as-judge on live production traces, not just offline datasets. Catch quality drift in real time.
5. Phoenix (Arize)
Interview Insight: Phoenix is the open-source alternative with a superpower: embedding analysis. When RAG quality tanks, Phoenix shows you how queries and documents cluster—or don't. Great for retrieval-heavy apps.
Analogy: Phoenix is like an X-ray for your RAG system. You see not just "retrieval returned 5 chunks" but where those chunks live in embedding space relative to the query. Irrelevant results? Phoenix shows you why—wrong embedding model, bad chunking, or threshold too loose.
Phoenix is an open-source AI observability platform from Arize for LLM, vision, and NLP models. Fully open source—runs locally or in cloud, no vendor lock-in. Uses OpenInference tracing (standardized LLM telemetry). Accepts traces over OTLP, integrates with OpenTelemetry. First-class support for LlamaIndex, LangChain, OpenAI, Anthropic, Bedrock, Google GenAI.
Key features: LLM tracing with automatic dashboards (tool calls, errors, token usage, costs, latency), dataset evaluators (attach evaluators to datasets for server-side eval), and a Playground for interactive prompt testing. Phoenix 13.0+ (Feb 2026): Dataset Evaluators with custom providers, OpenAI Responses API support, improved Playground workflows.
The killer feature: Embedding analysis. Visualize how queries and documents cluster. If RAG returns irrelevant chunks, Phoenix shows embedding distances—is it the query embedding, document embeddings, or similarity threshold?
Why This Matters in Production: RAG failures are often retrieval failures. Phoenix lets you debug at the embedding level instead of guessing. Use it when you need open-source observability with strong retrieval focus.
Aha Moment: Retrieval quality is invisible without embedding visualization. Phoenix turns "why is our RAG bad?" into a visual answer.
6. Tracing with OpenTelemetry
Interview Insight: OTel is becoming the standard. Know the GenAI semantic conventions:
gen_ai.provider.name,gen_ai.usage.input_tokens, etc. Vendor neutrality = you can switch backends without rewriting your app.
Analogy: OpenTelemetry is like USB for observability. One plug, many devices. You instrument once with standard spans; you can export to Jaeger, MLflow, Datadog, or a custom collector. Switch backends without touching application code.
OpenTelemetry defines semantic conventions for GenAI operations—standardized attribute names any backend understands. Key attributes: gen_ai.provider.name (openai, anthropic), gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.cache_read.input_tokens, gen_ai.usage.cache_creation.input_tokens. A 2025 proposal extends conventions for agentic systems: tasks, subtasks, actions, agents, teams, artifacts, memory.
How it works: instrument with OTel spans (or use opentelemetry-instrument for auto-instrumentation), export to any backend. Set OTEL_EXPORTER_OTLP_ENDPOINT. For latest GenAI conventions: OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental. MLflow Tracing is fully OTel-compatible; Phoenix accepts OTLP; LangSmith supports OTel export.
Why This Matters in Production: Vendor lock-in in observability is painful. With OTel, you're future-proof. Your traces work with whatever backend you choose tomorrow.
Aha Moment: Instrument once, export anywhere. That's the promise. As GenAI conventions mature, OTel becomes the lingua franca for LLM telemetry.
7. Cost Tracking
Interview Insight: "How do you manage LLM costs?" is almost guaranteed. The weak answer: "we set a budget." The strong answer: per-request tracking, per-agent aggregation, budgets + alerts, and specific optimization levers.
Analogy: Cost tracking for LLMs is like tracking your grocery bill. You don't just look at the total at the end of the month—you know which items (models, agents) are expensive and which are cheap. You make choices: buy generic (smaller models) for some tasks, splurge (frontier models) for others.
LLM API calls cost money. Input and output tokens are priced differently; models vary wildly. As of 2025–2026: GPT-4.1 ~$3–12/M input, more for output; DeepSeek V3.2 $0.28/$0.42 per 1M. Same task, orders of magnitude different cost. Unmonitored agents can burn thousands in hours.
Per-request formula: (input_tokens × input_price) + (output_tokens × output_price). Include cache reads—Anthropic charges 10% for cached tokens. Track token counts + model per call; multiply by current prices. LangSmith, Datadog, Langfuse auto-calculate; for custom providers, send usage_metadata.
Per-agent aggregation: sum cost per agent/feature/day. Set budgets: e.g., $500/day max; alert at 50%, 80%, 100%. Optimization levers: model routing (DeepSeek vs. GPT-4o = 80–90% savings for many tasks), prompt caching (up to 90% savings for repeated context), prompt optimization (shorter = fewer tokens), batching (50% discounts for non-urgent), fewer tool calls (agents that call tools every turn waste tokens).
Why This Matters in Production: Cost is a leading indicator of problems. A sudden spike often means a loop, a bug, or runaway agent. Budget alerts catch it before the bill does.
Aha Moment: Cost optimization is continuous, not one-time. Prices change, usage patterns shift, new models emerge. Review monthly.
8. Prompt Versioning
Interview Insight: "How do you roll back a bad prompt?" If you say "redeploy," you're slow. If you say "flip a config in the registry," you're ready for production.
Analogy: Prompt versioning is like Git for your system prompts. You need to know which version produced each response. When quality tanks, you check: was it prompt v3? Roll back to v2. No code deploy.
Track which prompt version produced each response. Link versions to quality metrics: did v2 improve faithfulness over v1? Store prompts in a registry (database, Git, Langfuse). Tag each trace with prompt_version or prompt_id. Maintain a mapping from version to content. Rollback = switch production config to previous version.
Version prompts alongside code in CI—run evals on each prompt change, block merges that regress quality. Rollback capability is critical: when a new prompt degrades production, revert in seconds, not hours.
Why This Matters in Production: The most common "incident" is a bad prompt deploy. With a registry and version tags, rollback is a config flip. Without it, you're digging through Git history and redeploying.
Aha Moment: Prompts are configuration that behaves like code. Treat them with the same discipline: version, test, gate merges.
9. Latency and Performance Dashboards
Interview Insight: Know the difference: P50 vs. P95 vs. P99. Know that for chat, first-token latency matters more than total for perceived responsiveness. Define SLAs per use case.
Analogy: Latency monitoring is like a race car telemetry. You need lap time (total) but also split times (retrieval, LLM, tools). The slow segment determines where you optimize. And for interactive chat, the "green light" moment is first token—users tolerate longer total if text appears quickly.
Track P50, P95, P99. Measure total and per-step breakdown. Is the slow part retrieval, LLM, or tool call? If retrieval is 2s and LLM is 1s, optimize retrieval first. For streaming: time to first token (TTFT) vs. time to last token (TTLT). Chat interfaces care about TTFT—perceived responsiveness. Batch jobs care about TTLT.
Dashboards: quality metrics over time (faithfulness, relevancy), cost trends, latency trends, error rates, user feedback. Segment by model, prompt version, agent, feature. Drift detection: quality declining over time can mean provider model update, data distribution shift, or prompt drift. Set thresholds: e.g., faithfulness < 0.8 for 3 hours → alert. Investigate promptly.
Why This Matters in Production: You can't optimize what you don't measure. Per-step latency tells you where to invest. Drift detection catches model/provider changes before users complain.
Aha Moment: Define SLAs per use case. Interactive chat: P95 < 3s. Batch processing: P95 < 30s. One size doesn't fit all.
10. Incident Response for AI Systems
Interview Insight: "Walk me through an AI incident you handled." Structure: detect → diagnose → respond → post-mortem. Traces are essential. Have runbooks. Blameless post-mortems.
Analogy: Incident response for AI is like ER triage. You need detection (who's coming in?), diagnosis (what's wrong?), response (stop the bleed), and post-mortem (how do we prevent this?). The difference: AI incidents are often quality failures, not availability. Your system is "up"—it's just wrong.
Detection: Automated quality monitoring (scores, error rates), user complaints, error spikes. Have alerting before incidents.
Diagnosis: Was it the model (provider update?), data (bad retrieval, stale index?), prompt (recent change?), or tool (external API down?)? Traces are essential—inspect the failing request's trace. Check recent deployments.
Response: Rollback prompt or model, disable the problematic agent, switch to fallback. Have runbooks: "If quality drops >15%, roll back to prompt v2." Communicate to stakeholders.
Post-mortem: Add failing case to golden dataset, improve evals to catch similar cases, update guardrails if safety-related. Document and share. Goal: prevent recurrence.
Why This Matters in Production: Incidents will happen. Without a process, you're scrambling. With runbooks and trace-based diagnosis, you fix fast and learn.
Aha Moment: The best incident response is preparation. Runbooks, rollback capability, golden datasets—built before the first incident.
Architecture Diagrams
LLMOps Pillars
flowchart TB
subgraph Observability
trace[Tracing]
metric[Metrics]
log[Logs]
trace --> dash[Dashboards]
metric --> dash
log --> dash
end
subgraph Evaluation
golden[Golden Dataset]
eval[Eval Suite]
regress[Regression Gates]
golden --> eval
eval --> regress
end
subgraph Deployment
promptReg[Prompt Registry]
canary[Canary Release]
rollback[Rollback]
promptReg --> canary
canary --> rollback
end
subgraph CostMgmt[Cost Management]
token[Token Tracking]
budget[Budget Alerts]
opt[Optimization]
token --> budget
budget --> opt
end
app[LLM Application] --> Observability
app --> Evaluation
app --> Deployment
app --> CostMgmtRequest Trace Flow
flowchart LR
req[Incoming Request] --> instr[Instrumented App]
instr --> span1[Span: Retrieval]
instr --> span2[Span: LLM Call]
instr --> span3[Span: Tool Call]
span1 --> span2
span2 --> span3
span1 --> otlp[OTLP Export]
span2 --> otlp
span3 --> otlp
otlp --> backend[Jaeger / MLflow / Datadog]
backend --> dash[Dashboard]Incident Response Flow
flowchart TB
detect[Detection: Alert or User Report] --> diag[Diagnosis]
diag --> q1{Model Update?}
q1 -->|Yes| r1[Check Provider Changelog]
q1 -->|No| q2{Recent Prompt Change?}
q2 -->|Yes| r2[Rollback Prompt]
q2 -->|No| q3{Retrieval Issue?}
q3 -->|Yes| r3[Check Index or Embeddings]
q3 -->|No| q4{Tool or API Down?}
q4 -->|Yes| r4[Switch Fallback or Disable Tool]
q4 -->|No| r5[Deep Dive: Trace Analysis]
r1 --> fix[Fix or Rollback]
r2 --> fix
r3 --> fix
r4 --> fix
r5 --> fix
fix --> post[Post-Mortem: Golden Set, Runbook Update]Cost Tracking Pipeline
flowchart LR
calls[LLM Calls] --> tokens[Token Counts]
tokens --> pricing[Pricing Table]
pricing --> costPerReq[Cost per Request]
costPerReq --> agg[Per-Agent Aggregation]
agg --> dashCost[Cost Dashboard]
agg --> alerts[Budget Alerts]Tool Comparison
flowchart TB
subgraph mlflow[MLflow]
openSource[Open Source]
fullLifecycle[Full Lifecycle]
otelCompat[OTel Compatible]
end
subgraph langsmith[LangSmith]
langchainNative[LangChain Native]
onlineEval[Online Evaluations]
zeroConfig[Zero-Config Tracing]
end
subgraph phoenix[Phoenix]
openSource2[Open Source]
embeddings[Embedding Analysis]
retrieval[Retrieval Focus]
end
choice{Choose By} --> mlflow
choice --> langsmith
choice --> phoenixCode Examples
MLflow Tracking Setup
import mlflow
from openai import OpenAI
mlflow.openai.autolog()
mlflow.set_experiment("RAG Pipeline")
with mlflow.start_run(run_name="prod-v2"):
mlflow.log_param("model", "gpt-4o")
mlflow.log_param("temperature", 0.0)
mlflow.log_param("prompt_version", "v2.1")
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is the refund policy?"}],
temperature=0.0,
)
usage = response.usage
mlflow.log_metric("input_tokens", usage.prompt_tokens)
mlflow.log_metric("output_tokens", usage.completion_tokens)
mlflow.log_metric("total_tokens", usage.total_tokens)LangSmith Tracing (Zero Config)
import os
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-api-key"
os.environ["LANGCHAIN_PROJECT"] = "my-agent"
llm = ChatOpenAI(model="gpt-4o", temperature=0)
response = llm.invoke([HumanMessage(content="Summarize the key points.")])
# Trace appears in LangSmith with full span hierarchyLangSmith Custom Cost (usage_metadata)
from langsmith import traceable, get_current_run_tree
@traceable(run_type="llm", metadata={"ls_provider": "openai", "ls_model_name": "gpt-4o"})
def chat_model(messages: list):
response = {"choices": [{"message": {"content": "Sure, what time?"}}]}
token_usage = {
"input_tokens": 27,
"output_tokens": 13,
"total_tokens": 40,
"input_token_details": {"cache_read": 10},
}
run = get_current_run_tree()
run.set(usage_metadata=token_usage)
return response
@traceable(run_type="tool", name="get_weather")
def get_weather(city: str):
result = {"temperature_f": 68, "condition": "sunny"}
run = get_current_run_tree()
run.set(usage_metadata={"total_cost": 0.0015})
return resultCost Calculation
PRICING = {
"gpt-4o": {"input": 2.50, "output": 10.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"claude-3-5-sonnet": {"input": 3.00, "output": 15.00},
}
def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
if model not in PRICING:
return 0.0
prices = PRICING[model]
input_cost = (input_tokens / 1_000_000) * prices["input"]
output_cost = (output_tokens / 1_000_000) * prices["output"]
return input_cost + output_cost
cost = calculate_cost("gpt-4o", response.usage.prompt_tokens, response.usage.completion_tokens)Latency Breakdown
import time
from contextlib import contextmanager
@contextmanager
def timed_span(name: str, tracer=None):
start = time.perf_counter()
try:
yield
finally:
duration_ms = (time.perf_counter() - start) * 1000
if tracer:
tracer.record_span(name, duration_ms)
with timed_span("retrieval", tracer):
chunks = retriever.invoke(query)
with timed_span("llm_generation", tracer):
response = llm.invoke([SystemMessage(content=prompt), HumanMessage(content=query)])OpenTelemetry GenAI Span
from opentelemetry import trace
tracer = trace.get_tracer("my-llm-app", "1.0.0")
with tracer.start_as_current_span("openai.chat") as span:
span.set_attribute("gen_ai.provider.name", "openai")
span.set_attribute("gen_ai.request.model", "gpt-4o")
response = client.chat.completions.create(...)
span.set_attribute("gen_ai.usage.input_tokens", response.usage.prompt_tokens)
span.set_attribute("gen_ai.usage.output_tokens", response.usage.completion_tokens)Conversational Interview Q&A
1. "How do you trace a request through a multi-agent system?"
Weak answer: "We use logs and maybe some custom instrumentation to see what each agent does."
Strong answer: "Every step—every agent, LLM call, tool call, retrieval—creates a span with proper parent-child relationships. The root span is the incoming request; each agent is a child; within an agent, retrieval, LLM, and tools are children of that agent span. At Maersk, we used MLflow Tracing with our Enterprise AI Agent Platform—context propagation happens automatically so when Agent A hands off to Agent B, the trace continues. We can answer: which agent was slow? Did a tool fail? Where did the chain break? For a user-reported bad response, we find the trace by request ID and walk the span tree. We also aggregate metrics: latency and cost per agent, which drove optimization—we simplified the graph when we saw Agent B was expensive and rarely added value."
2. "Walk me through your cost optimization strategy."
Weak answer: "We set budgets and try to use cheaper models when we can."
Strong answer: "First, we measure. We track token usage and cost per request, per agent, per feature. At Maersk, each agent on the platform had a unique ID; every LLM call logged agent_id, model, tokens, timestamp. A nightly job aggregated cost per agent per day. We set budgets—e.g., $100/day for support, $50 for summarization—and alerts at 50%, 80%, 100%. The biggest lever was model routing: simple tasks (classification, extraction) went to GPT-4o-mini or similar; complex reasoning used frontier models. Our email booking agent was expensive—multiple LLM calls per email (parse, propose, confirm). We combined steps and used a smaller model for parsing. Second, prompt optimization—shorter prompts, fewer tokens. Third, caching for repeated RAG context. Fourth, batching for non-urgent jobs. We review monthly; cost optimization is continuous."
3. "How do you handle model degradation in production?"
Weak answer: "We monitor error rates and if something looks wrong we investigate."
Strong answer: "Degradation can come from provider updates, distribution shift, or our own changes. We detect via quality metrics—faithfulness, relevancy, user feedback—and LLM-as-judge on production samples. At Maersk, we had alerts when faithfulness dropped below a threshold for a few hours. Diagnosis: we compare recent changes—new prompt? Model switch? Retriever update? We run the offline eval suite; if it passes offline but fails in prod, it's likely distribution shift. We inspect failing traces: what do bad responses have in common? Response: roll back prompt, switch model, or fix retriever. We add failing cases to the golden dataset. We had one incident where the provider updated the model and our prompt stopped working—we rewrote the prompt and added that case to evals. Post-mortem: document, update runbooks, share learnings."
4. "What monitoring do you set up before going live?"
Weak answer: "We make sure we have some logging and maybe error tracking."
Strong answer: "Full observability before launch. Tracing: every LLM call, retrieval, tool call produces a span—we used MLflow with our agent platform. Metrics: P50/P95/P99 latency for full request and per step, token usage, cost per request, quality scores, error rate. Alerts: latency P95 > 5s, error rate > 1%, quality below threshold, cost at 80% of budget. We set up a synthetic check—test request every few minutes. Rollback capability: prompt registry, model fallback, runbooks. At Maersk, we ran canaries—5% traffic to new agent, monitor 24–48 hours, then ramp. Going live without this is risky; you'll have incidents, and without monitoring you won't know until users complain."
5. "How do you decide: prompt problem, retrieval problem, or model problem?"
Weak answer: "We look at the output and try to figure it out."
Strong answer: "First, the trace. What did we retrieve? What did we send to the LLM? What did it return? If retrieved context is irrelevant or missing key info, it's retrieval—the LLM can only answer from what it receives. Fix: improve retriever, expand index, or query rewriting. If context is good but output is wrong, it's prompt or model. Does the same prompt work for similar inputs? If yes, model non-determinism or edge case. If consistently bad for a class of inputs, it's prompt—unclear instructions, wrong format. We A/B test prompt variants. If we recently switched models and quality dropped globally, it's a model problem—adapt the prompt or revert. We also run component-level evals: retrieval alone (context precision, recall), generator with fixed good context. That decomposition drives targeted fixes."
6. "Compare MLflow, LangSmith, and Phoenix. When would you choose each?"
Weak answer: "They're all good. MLflow is for ML, LangSmith is for LangChain, Phoenix is another option."
Strong answer: "MLflow: open source, full lifecycle, runs on your infra. MLflow Tracing is OTel-compatible, 20+ GenAI integrations,
mlflow-tracingfor production. Use when you want self-hosting and no vendor lock-in—we used it at Maersk for the Enterprise AI Agent Platform. LangSmith: LangChain's platform, zero-config tracing with env vars, datasets, online evaluations. Use when you're on LangChain/LangGraph and want the smoothest integration. Phoenix: open source from Arize, excels at embedding analysis and retrieval quality—visualize how queries and docs cluster. Use when RAG is critical and you need to debug at the embedding level. Summary: MLflow for full lifecycle and self-hosting; LangSmith for LangChain; Phoenix for retrieval-focused open source. Many teams combine them."
7. "Describe your incident response process for AI systems."
Weak answer: "We get alerted and then we try to fix it."
Strong answer: "Triage: assess severity—all users or subset? Quality degraded or system down? Page on-call, create incident channel. Mitigate: stop the bleed. At Maersk we had a runbook: if quality dropped, check recent deployments. New prompt in last 24 hours? Roll back immediately—prompt registry means rollback is a config flip. Model provider update? Switch to fallback. Tool failing? Disable or circuit breaker. Communicate: 'We're investigating; we've rolled back X.' Diagnose: pull traces for failing requests, compare before/after, run offline evals, check provider changelogs. Resolve: fix and deploy with canary, add failing cases to golden set. Post-mortem: document root cause, update runbooks, improve monitoring. Blameless. Goal: prevent recurrence and get faster next time."
From Your Experience
"You integrated MLflow for observability at Maersk. What did you track?"
We integrated MLflow Tracing for our RAG-based agents on the Enterprise AI Agent Platform. We tracked every step: user query, retrieval (query, top-k chunks, similarity scores), LLM call (full prompt, completion, token counts, latency), and final response. We used mlflow.openai.autolog() for OpenAI and manual spans for custom retrieval and orchestration. We logged parameters per run: model, temperature, prompt version, retriever config. Metrics: input_tokens, output_tokens, retrieval_latency_ms, llm_latency_ms, total_latency_ms, and cost per request. We ran evals (faithfulness, answer relevancy) on trace samples. In production we used mlflow-tracing for lightweight footprint and async logging.
"How did cost tracking work per agent on the platform?"
Each agent had a unique ID. Every LLM call logged agent_id, model, input_tokens, output_tokens, timestamp. A nightly job aggregated cost per agent per day. Dashboard showed daily spend per agent. Budgets: e.g., $100 for support, $50 for summarization. Alerts at 50%, 80%, 100%. At 80% we'd investigate—sometimes expected spike, sometimes a bug (e.g., agent stuck in a loop). We found the email booking agent was expensive due to multiple LLM calls per email; we optimized by combining steps and using a smaller model for parsing. Cost tracking drove both budgeting and optimization.
"What was your incident response process?"
Runbook-based. When quality dropped (faithfulness or thumbs-up rate below threshold): (1) Check recent deployments—new prompt or model in 24h? (2) If yes, roll back. Prompt registry meant config flip. (3) If no, pull traces for failing requests. Bad retrieval? Hallucination? Tool failure? (4) Fallback: switch to simpler flow (e.g., FAQ matching) to reduce blast radius. (5) Post-incident: add failing cases to golden set, run evals, update runbook. Blameless post-mortem. One incident: provider updated the model, our prompt stopped working well. We rewrote the prompt, added explicit instructions, put that case in the golden set.
Quick Fire Round
-
What are the three pillars of LLM observability? Traces, metrics, logs.
-
Why isn't HTTP 200 enough for LLM apps? It doesn't tell you if the answer was correct—hallucination, bad retrieval, harmful output can all return 200.
-
What's the MLflow lightweight package for production?
mlflow-tracing—95% smaller footprint, async logging. -
How do you enable LangSmith tracing with zero code changes? Set
LANGCHAIN_TRACING_V2=trueand API key. -
Phoenix's killer feature for RAG? Embedding analysis—visualize how queries and documents cluster.
-
Key OTel GenAI attributes?
gen_ai.provider.name,gen_ai.usage.input_tokens,gen_ai.usage.output_tokens. -
Cost formula per request?
(input_tokens × input_price) + (output_tokens × output_price). -
Biggest cost optimization lever? Model routing—smaller models for simple tasks, 80–90% savings possible.
-
What enables prompt rollback without code deploy? Prompt registry with version tags; rollback = config flip.
-
For chat, which latency matters more for perceived responsiveness? Time to first token (TTFT).
-
What's drift detection? Quality metrics declining over time—can mean provider update, distribution shift, or prompt drift.
-
First step when quality drops in production? Check recent deployments; if new prompt, roll back immediately.
-
What do you add to the golden dataset after an incident? The failing case(s) so future evals catch similar issues.
-
MLflow vs. LangSmith vs. Phoenix in one sentence each? MLflow: full lifecycle, self-hosted. LangSmith: LangChain native, zero config. Phoenix: open source, retrieval/embedding focus.
-
What's
usage_metadatain LangSmith? Token usage and cost data for custom or non-standard models—LangSmith computes cost from model pricing.
Key Takeaways
| Topic | Takeaway |
|---|---|
| LLMOps scope | The "model" is the full system—prompt, retrieval, orchestration, tools—not just the base LLM. |
| Observability | Traditional monitoring (HTTP 200) is insufficient. Observe quality at every step: traces, metrics, logs. |
| MLflow | Open source, OTel-compatible, 20+ integrations. Use mlflow-tracing in production for lightweight footprint. |
| LangSmith | Zero-config for LangChain/LangGraph. Online evaluations, cost tracking, datasets. |
| Phoenix | Open source, embedding analysis, retrieval quality. Best for RAG debugging. |
| OpenTelemetry | Vendor-neutral; instrument once, export anywhere. GenAI semantic conventions maturing. |
| Cost | Track per request, per agent. Budgets + alerts. Optimize: model routing, caching, prompt compression, fewer tool calls. |
| Prompt versioning | Registry + version tags. Rollback = config flip. Version prompts in CI with eval gates. |
| Latency | P50/P95/P99 total and per step. For chat, TTFT matters. Define SLAs per use case. |
| Incident response | Detect → diagnose (traces!) → respond (rollback, fallback) → post-mortem (golden set, runbook). |
Further Reading
- MLflow Tracing (mlflow.org/docs/latest/llms/tracing): Autolog for 20+ GenAI libraries,
mlflow-tracingproduction SDK, OTel export, human feedback, dataset collection. - LangSmith (docs.langchain.com/langsmith): Tracing,
usage_metadata, datasets, online evaluations (LLM-as-judge), monitoring, alerting. - Arize Phoenix (docs.arize.com/phoenix): OpenInference/OTLP tracing, metrics, embedding analysis, retrieval quality, Dataset Evaluators (v13+).
- OpenTelemetry GenAI Conventions (opentelemetry.io/docs/specs/semconv/gen-ai):
gen_ai.*attributes, agent spans, provider-specific conventions. - OpenInference (arize-ai.github.io/openinference/spec): Standardized LLM telemetry format.
- LLM API Pricing 2025 (IntuitionLabs, Helicone): Model pricing comparison, cost calculators.
- OWASP Top 10 for LLM Applications: Security for production LLM systems.