Session 16: System Design & Behavioral Prep
System design interviews are where you prove you can actually build things. Not toy projects—real systems with real users, real failure modes, and real cost constraints. The difference between a senior engineer and a mid-level one often comes down to this: can you hold an entire architecture in your head, reason about trade-offs under pressure, and articulate why you'd choose one path over another?
Behavioral interviews are the flip side. They're asking: "Will this person make good decisions when we're not watching? Can they recover from failure? Will they actually collaborate?" Your technical chops get you in the room; your stories seal the deal.
This session is your capstone. We'll walk through how to crush a 45-minute AI system design interview, nail the behavioral questions with STAR stories grounded in your Maersk experience, and give you the quick-reference material you need right before you walk in. Let's go.
1. Requirements Clarification
You're about to architect an AI system. Your first instinct might be to grab a whiteboard and start drawing boxes. Resist it. The fastest way to lose an interviewer is to build the wrong thing really well.
Interview Insight: "Before I sketch the architecture, I'd like to clarify a few things—especially around scale, latency, and data privacy" is a sentence that separates senior thinking from implementer thinking. It signals you think like a system owner.
Users and scale: Who uses this? Internal teams, external customers, both? How many DAU? What's the RPS? Regional or global? These questions bound your entire design—a system for 100 internal users looks nothing like one for millions of customers.
Quality requirements: What accuracy are we aiming for (e.g., 95% extraction)? What's the latency SLA (P95 under 2 seconds for chat, 500ms for autocomplete)? Can we tolerate some factual errors, or is this legal/financial where mistakes are catastrophic?
Constraints: Budget—cost per request, monthly cap? Security—can data leave our premises? Regulatory—GDPR, HIPAA, SOC2? Data freshness—how often does the knowledge base update?
LLM-specific questions: Context size (long docs, multi-turn)? Do we need RAG and from what sources? Structured output or free-form? What feedback signals exist (thumbs up/down, corrections, human review)?
Think of it like ordering at a restaurant. You don't just say "I want food." You ask: dietary restrictions? Budget? How hungry? The waiter (interviewer) has answers—your job is to ask until the picture is clear.
Why This Matters in Production: At Maersk, the email booking system had to handle multilingual customer emails, port codes, commodity codes, and human review for low-confidence extractions. If we'd skipped clarification, we might have built a single-language, auto-only system that would have failed on day one.
Aha Moment: The interviewer isn't testing whether you know all the questions—they're testing whether you ask them. Silence is a red flag. Curiosity is a green one.
2. High-Level Architecture
Now you draw. The goal: main components, data flow, and key design decisions. Not implementation details—architecture.
Interview Insight: A clean diagram you can point to throughout the interview is worth more than a thousand words. Draw it once, refer to it often.
Components to include:
- API gateway: Auth, rate limiting, telemetry
- Orchestration layer: Coordinates prompts, retrieval, model selection, tool calls
- Retriever (if RAG): Embedding, vector search, reranking
- Model layer: LLM gateway, model router, fallback logic
- Post-processor: Output validation, guardrails, formatting
- Storage: Vector store, cache, audit logs
Key design decisions to articulate:
- Model selection: GPT-4o for complex reasoning, GPT-4o-mini for simple queries? Why?
- RAG vs. fine-tuning: When do you retrieve vs. bake knowledge into the model?
- Agent vs. single-call: When does an agentic loop beat a single LLM invocation?
- Stateless vs. stateful: Session memory, conversation history, tool state—how do you manage it?
Trace a request from user input through each component to the final response. Show where tokens flow, where retrieval happens, and where costs accumulate. Imagine you're walking a new hire through the system on their first day—if you can't explain the flow in one breath, you haven't nailed it yet.
Why This Matters in Production: Our Enterprise AI Agent Platform at Maersk has a centralized model router, guardrail engine, and tool registry. The architecture isn't accidental—every component exists because we learned the hard way that scattergun approaches (each team with their own LLM access, ad-hoc guardrails) create chaos.
Aha Moment: "I chose X over Y because..." is the refrain of a strong candidate. Justify your choices. The interviewer wants to see your reasoning, not just a diagram they could find on a blog.
3. Deep Dive on Critical Components
The interviewer will probe. They want to see if you have depth, not just breadth. Pick 1–2 subsystems and go deep. Here's what they might ask about.
RAG systems: Chunking strategy (fixed-size vs. semantic, overlap), embedding model choice, vector store (Pinecone, Weaviate, pgvector?), hybrid search (BM25 + vector), reranking, context assembly (how many chunks, token budget). For email booking at Maersk, we use semantic chunks for port/commodity descriptions and smaller chunks for codes—exact match matters.
Agent systems: Orchestration pattern (supervisor vs. sequential vs. routing), tool design (single-responsibility, parameter validation), state management, error handling (retries, fallbacks, human handoff, circuit breakers). We learned the hard way: start simple. Narrow tool scope. Add complexity only when needed.
Platforms: Multi-tenancy (tenant isolation, quotas, cost allocation), guardrails (input/output validation, content policies, PII), evaluation (online vs. offline evals, golden datasets, A/B testing), observability (tracing, metrics, dashboards, alerting). The Enterprise AI Agent Platform centralizes all of this so no agent can bypass safety checks.
Think of it like a car engine. Anyone can draw a rectangle labeled "engine." The senior engineer knows when to talk about fuel injection vs. turbocharging—and when each choice bites you.
Interview Insight: "We considered X but chose Y because of Z" shows you've thought through alternatives. Listing components without trade-offs reads like a textbook.
Why This Matters in Production: A RAG system without proper chunking will hallucinate port codes. An agent without guardrails will promise refunds it can't deliver. A platform without observability will burn money without anyone noticing. Depth in these areas prevents disasters.
Aha Moment: The interviewer isn't trying to stump you. They're trying to find the edge of your knowledge. "I haven't worked with that specifically, but here's how I'd approach it..." is a valid move. Honesty beats bluffing.
4. Production Concerns
Happy path design is table stakes. What separates senior engineers is thinking beyond it: monitoring, scaling, failure handling, cost management, security.
Monitoring: Token usage per request, latency (P50, P95, P99), error rate, retrieval hit rate. Dashboards for cost by model/feature, extraction accuracy, human review rate. Alerts for latency spikes, error rate, cost anomalies, guardrail violations.
Scaling: Stateless API servers, async processing, semantic cache for repeated queries, prompt caching for static content. Model routing: 70% to cheap model, 20% mid-tier, 10% premium. At Maersk, we route FAQ to GPT-4o-mini and complex extraction to GPT-4o—saves real money.
Failure handling: Circuit breakers when a model or service fails. Fallbacks to simpler models or cached responses. Graceful degradation: partial results, retry prompts, human handoff. Never let the system go dark silently.
Cost management: Token budgets per request, per user, per tenant. Caching to reduce redundant LLM calls. We track cost per booking in the email automation system—it's a first-class metric.
Security: Guardrails (prompt injection defense, output validation, PII masking), access control (per-tenant, per-user, per-feature), audit logging for compliance. Enterprise means someone will ask "where does the data go?" You need an answer.
Interview Insight: "What could go wrong?" is a question you should ask yourself before they do. Production systems fail, drift, and cost more than expected. Showing you've thought about that is half the battle.
Why This Matters in Production: We've seen extraction accuracy drop after an edge case we didn't anticipate (multi-booking emails, contradictory dates, non-English). We've seen costs spike when model routing wasn't tuned. We've seen guardrail violations when an agent bypassed validation. These aren't hypotheticals—they're post-mortems.
Aha Moment: Production thinking isn't an add-on. It's the difference between "we built a demo" and "we built something that runs in production and we can sleep at night."
5. The 45-Minute Interview Structure
Use this as your mental checklist. Don't rigidly time yourself, but aim for the rhythm.
| Phase | Time | Focus |
|---|---|---|
| Requirements | 3–5 min | Clarify users, scale, quality, constraints, LLM-specific needs |
| High-level architecture | 10 min | Diagram, components, data flow, key decisions |
| Deep dive | 15–20 min | Go deep on RAG, agents, or platform—justify choices |
| Production | 10 min | Monitoring, scaling, failure, cost, security |
| Discussion | 5 min | Trade-offs, future improvements, questions for them |
flowchart LR
req[Requirements] --> arch[Architecture]
arch --> deepDive[Deep Dive]
deepDive --> prod[Production]
prod --> discuss[Discussion]Interview Insight: Leave room for discussion. "If latency were more critical, I'd consider..." or "What are the biggest pain points you've seen in production?" shows you're thinking beyond the whiteboard.
Why This Matters in Production: Real interviews don't stick to the script. The discussion phase is where you show you can think on your feet and engage with their specific concerns.
Aha Moment: Running out of time before production concerns is the most common mistake. If you're 25 minutes in and still polishing the diagram, you've lost. Move on.
6. Practice Design Problems
Design Problem 1: Email Extraction & Booking (Maersk-flavored)
Context: Unstructured customer emails → structured bookings. High-confidence auto-create; low-confidence human review. This is exactly what we built.
Architecture: Email ingestion → LLM extraction agent → RAG enrichment (port codes, commodity codes, customer info) → confidence scoring → route to auto-create or human review → Booking API.
Key decisions:
- Structured output schema (JSON) for extraction
- RAG enriches and validates to reduce hallucinations (e.g., invalid port codes)
- Per-field confidence, aggregate threshold (e.g., >0.9 → auto)
- Human feedback loop: corrections improve future extractions
flowchart TB
subgraph ingestion
emailApi[Email API] --> parser[Email Parser]
parser --> attachments[Attachment Processor]
end
subgraph extraction
attachments --> llmExtract[LLM Extraction Agent]
llmExtract --> ragEnrich[RAG Enrichment]
ragEnrich --> confidence[Confidence Scorer]
end
subgraph routing
confidence --> decision{Confidence > 0.9?}
decision -->|Yes| autoCreate[Auto-create Booking]
decision -->|No| reviewQueue[Human Review Queue]
reviewQueue --> humanApproves[Human Approves]
humanApproves --> autoCreate
end
autoCreate --> bookingApi[Booking API]Design Problem 2: Multi-Agent Customer Support
Context: Chat-based support with FAQ, order status, complaints, technical issues. Escalate unresolved to humans. 70%+ resolution without handoff.
Architecture: User message → router agent (intent) → specialist agents (FAQ/RAG, order tools, complaint empathy, technical troubleshooting) → unresolved? → human handoff.
Key decisions:
- Router: GPT-4o-mini, fast. Fallback to complaint/technical if ambiguous (safer to escalate)
- Tools: Single-responsibility. Order agent can't promise refunds. Complaint agent has escalation tools only.
- Guardrails: No unauthorized promises. PII out of logs.
flowchart TB
userMsg[User Message] --> router[Router Agent]
router --> intent{Intent}
intent -->|FAQ| faqAgent[FAQ Agent + RAG]
intent -->|Order| orderAgent[Order Agent + Tools]
intent -->|Complaint| complaintAgent[Complaint Agent]
intent -->|Technical| techAgent[Technical Agent]
faqAgent --> resolved{Resolved?}
orderAgent --> resolved
complaintAgent --> resolved
techAgent --> resolved
resolved -->|Yes| response[Response to User]
resolved -->|No| handoffQueue[Human Handoff Queue]Design Problem 3: Enterprise AI Agent Platform (Maersk-flavored)
Context: Centralized platform for multiple teams. Model access, guardrails, evaluations, observability, prompt management, tools. This is the platform we built.
Architecture: API gateway → agent orchestration → model router → LLM providers. Guardrail engine, tool registry, evaluation service, observability (MLflow, tracing). Prompt registry, dataset management.
Key decisions:
- Multi-tenancy: Tenant ID everywhere. Row-level security. Per-tenant quotas.
- Guardrail engine: No agent bypasses. Per-agent config for sensitivity.
- Self-service: Teams define agents via config, submit for approval. Platform team reviews guardrails, tools, cost budget.
flowchart TB
subgraph gateway
apiGw[API Gateway] --> auth[Auth / Rate Limit]
end
subgraph platform
auth --> orchestration[Agent Orchestration]
orchestration --> modelRouter[Model Router]
modelRouter --> llmProviders[LLM Providers]
orchestration --> guardrails[Guardrail Engine]
orchestration --> toolRegistry[Tool Registry]
orchestration --> evals[Evaluation Service]
end
subgraph observability
mlflow[MLflow]
tracing[Tracing]
dashboards[Dashboards]
end
orchestration --> mlflow
orchestration --> tracing
orchestration --> dashboardsDesign Problem 4: RAG for 10M Legal Documents
Context: Lawyers search 10M docs. Citations must be correct—wrong citations are unacceptable. Moderate query volume. Documents change infrequently (weekly batch).
Architecture: Document ingestion → chunking → embedding → vector store. Query → embed → hybrid search (BM25 + vector) → rerank → generation with citations → grounding verification.
Key decisions:
- Chunking: Hierarchical (by section) + semantic. Tables as single chunks—fee schedules, etc. Citation metadata on every chunk.
- Hybrid search: BM25 for exact terms (case names, statute numbers); vector for semantic ("breach of contract" ≈ "contract violation"). RRF fusion.
- Reranking: Cross-encoder on top-50 → top-10. Legal needs high precision.
- Grounding verification: Every citation in output must appear in retrieved chunks. Answer must not add info outside context. Fail → flag or "I couldn't find sufficient support."
- Accuracy over latency: 3–5 second latency is acceptable for high-stakes legal queries.
flowchart TB
subgraph ingestion
docs[Documents] --> chunking[Chunking]
chunking --> embedding[Embedding]
embedding --> vectorStore[Vector Store]
end
subgraph query
userQuery[User Query] --> queryEmbed[Embed Query]
queryEmbed --> hybridSearch[Hybrid Search BM25 + Vector]
hybridSearch --> rerank[Reranker]
rerank --> generation[Generation + Citations]
generation --> grounding[Grounding Verification]
end7. Key Dimensions Checklist
| Dimension | Key Questions |
|---|---|
| Model selection | Which model(s)? Routing by complexity/cost? Fallbacks? |
| Data pipeline | Ingestion, chunking, embedding, indexing. Update frequency? |
| Retrieval | Chunking, embedding, vector store, hybrid, reranking |
| Agents | Orchestration, tools, state, error handling, handoff |
| Evaluation | Golden datasets, offline/online evals, A/B, human review |
| Guardrails | Input/output validation, PII, content policy |
| Observability | Metrics, tracing, dashboards, alerts, cost |
| Scaling/cost | Horizontal scale, caching, model routing, token budgets |
| Error handling | Retries, fallbacks, circuit breakers, degradation |
| Security | Auth, access control, audit logging, compliance |
8. Behavioral: The STAR Method
Situation: Context. Where, who, state of things. (2–3 sentences)
Task: Your responsibility. Goal, constraints. (1–2 sentences)
Action: What you did. Decisions, steps, challenges. Longest part. (3–5 sentences)
Result: Outcome. Quantify. What you learned. (2–3 sentences)
Interview Insight: "I would..." is weak. "I did..." is strong. Specific examples, not hypotheticals. For senior roles, emphasize influence, mentoring, cross-team impact.
9. Your 8 STAR Stories (Maersk-Grounded)
Story 1: Most Complex AI System You Designed
Situation: At Maersk, different teams—logistics, customer service, finance—were building AI agents in silos. Separate LLM access, ad-hoc guardrails, no cost visibility, duplicated eval and observability work. No central place to deploy or govern. Classic "every team reinvents the wheel" problem.
Task: Design and architect an enterprise AI agent platform from scratch. Centralized model access, guardrails, evals, observability, prompt management, tools. Self-service deployment for teams—so they could move fast without us becoming a bottleneck.
Action: Led architecture design with stakeholders from each team. Decisions: (1) Centralized model router with per-tenant rate limits and cost tracking—OpenAI, Azure, Anthropic based on compliance. (2) Guardrail engine—input/output validation, no bypass. (3) Tool registry (MCP, custom), sandboxed, audited. (4) Evaluation service—golden datasets, offline/online evals. (5) Observability—MLflow, tracing. Modular design for independent scaling. Tenant IDs everywhere, row-level security. Aligned with security/compliance early. Piloted with two teams before rollout.
Result: Twelve production agents across five teams. 30% less redundant API spend. Zero guardrail violations for new deployments. Two teams migrated in six weeks. The platform became the default path for new AI initiatives.
What makes this answer strong: You led the design, not just implemented it. You name concrete components and justify them. You show cross-team influence and production outcomes. Specific numbers (12 agents, 30%, 6 weeks) beat vague claims.
Story 2: AI System Failed in Production
Situation: Email booking automation was live. We extracted cargo, origin, destination, dates and auto-created bookings for high-confidence cases. Worked great in demos. After a few weeks in production: wrong extractions, missed bookings, customer complaints. Edge cases bit us—multi-booking emails, different languages, contradictory info ("ship Monday" vs. "ETA Tuesday"), "as per my last email" with no context. The happy path was fine; the long tail was a mess.
Task: Diagnose failure patterns, fix the pipeline, prevent recurrence. Maintain customer trust, avoid incorrect bookings. No hiding under the desk—we had to own it.
Action: Led post-mortem. Categorized failures: multi-booking (merge/drop), language (lower accuracy), contradictions (arbitrary choice), reference emails (no context). Added edge cases to eval dataset, ran regression tests. Improved prompt: explicit multi-booking (JSON array), flag contradictions for human review. Language detection → multilingual model. Lowered auto-create threshold, increased human-in-the-loop. Guardrail: contradictory dates → human review. Monitoring: extraction accuracy per field, human override rate, failure by email type. Weekly eval runs. Documented everything.
Result: Extraction accuracy went from 78% to 92%. Human review 15% → 25% → 18% as the model improved. 50+ new eval cases added. No critical booking errors the next quarter. The incident became a case study for "why evals and guardrails matter."
What makes this answer strong: You owned the failure. You structured the diagnosis, applied fixes, and added guardrails. You show resilience and systematic thinking. Quantified before/after. Honest about what went wrong—and what you learned.
Story 3: Disagreement on Technical Direction
Situation: Debate over RAG pipeline: build custom vs. use a managed service. I wanted in-house control for flexibility and cost; a key stakeholder wanted a vendor for speed and "don't build what you can buy." Decision affected the 6-month roadmap and multiple teams.
Task: Align the team while respecting constraints. Find a path that didn't leave anyone feeling railroaded.
Action: Technical deep-dive. Comparison: build vs. buy (time, flexibility, cost, lock-in, maintenance). Acknowledged the speed concern—we weren't ignoring it. Proposed hybrid: managed vector store (Pinecone/Weaviate) + custom chunking and retrieval logic. 80% of vendor speed, control over the parts that mattered. Ran a 2-week PoC to prove feasibility. Documented trade-offs, set a 3-month review trigger. Presented options with pros/cons, not just "my way."
Result: Shipped in four weeks. Custom chunking improved retrieval quality by 15% over the vendor default. The stakeholder became an advocate for the hybrid approach. Decision doc is still referenced when similar debates come up.
What makes this answer strong: You didn't dig in—you found a compromise that addressed both concerns. You documented the decision for future reference. Shows you can influence without authority and build consensus.
Story 4: Biggest Learning Building AI Agents
Situation: We built a multi-agent support system with broad autonomy—each agent could call any tool, make any decision. Goal: 70% resolution without handoff. We were at 45%. Failures: wrong promises (agent promised refunds we couldn't deliver), infinite loops, wrong info when we should have escalated. More autonomy sounded great in theory; in practice, it was chaos.
Action: Realized we'd over-engineered. Stepped back. Adopted simpler approach: (1) Narrow tool scope—complaint agent can't call order API. Single responsibility. (2) Deterministic guardrails—"Never promise refund without authorization." Hard rules, not LLM judgment. (3) Step limits—cap tool calls, total turns. No runaway loops. (4) Confidence-based escalation—below threshold → human. Simplified router: fast classifier, one-shot route. More predictable, easier to debug. Added evals for each failure mode in CI. Deployed incrementally.
Result: Resolution rate hit 72%. "Wrong promise" escalations dropped to zero. Latency improved 30%. Learning: start simple, add complexity only when needed. We now evangelize this internally—"narrow tools, hard guardrails, step limits" is our default advice for new agents.
What makes this answer strong: You admit the initial approach was wrong. You describe a concrete fix and a reusable lesson. Shows intellectual honesty and the ability to course-correct. The lesson is transferable—interviewers love that.
Story 5: Build vs. Buy
Situation: Guardrail solution needed. Options: custom (NeMo, Guardrails AI), managed (Azure Content Safety, OpenAI Moderation), enterprise vendor. Three-month window to production. Everyone had an opinion; we needed a decision framework, not a gut call.
Action: Built a decision matrix: time to market, fit, cost, control, lock-in, compliance. Ran PoC for top two options. Chose hybrid: Azure Content Safety for content moderation (buy—proven, fast); custom rules for prompt injection and PII (build—our requirements are specific, GDPR-sensitive). Used Guardrails AI as reference, built own pipeline for flexibility. Documented criteria for revisiting—"if X happens, we reconsider."
Result: Shipped in 10 weeks. 40% cheaper than full vendor lock-in. Full PII control for GDPR. Decision doc is still the template for build-vs-buy debates. We've reused the hybrid pattern twice since.
What makes this answer strong: You used a structured framework, not gut feel. You explain the hybrid logic clearly—buy where it makes sense, build where it doesn't. Shows disciplined decision-making.
Story 6: Mentoring Junior Engineers
Situation: Two juniors joined—strong SWE background, new to LLMs. Building RAG Q&A, stuck on retrieval quality. Chunks were wrong, reranking wasn't helping, they didn't know how to debug. They'd been thrown at the problem without the mental model.
Action: Weekly 1:1s. Shared internal docs, papers, our eval framework. Walked through the pipeline: chunking, embedding, search, reranking. Debugged together—added logging to inspect chunks, ran eval queries, traced failures. Introduced evals: golden questions, expected answers. "When you find a bug, add it as an eval case." Code review focus: "What if retrieval returns nothing? What if the model hallucinates?" Gave ownership—one took reranking, one took chunking. Iterated with feedback. Had them present in team meetings. Didn't do it for them—taught them how.
Result: Three months: +20% retrieval accuracy, 30 new eval cases. One of them now leads RAG improvements. The same junior mentored a new hire with the same approach—knowledge transferred. They went from "stuck" to "driving improvements."
What makes this answer strong: You didn't just "help"—you transferred a methodology (evals, failure modes, ownership). You grew the team. Quantified outcomes. Shows leadership and investment in people.
Story 7: Delivering Under Tight Deadlines
Situation: Demo for a key stakeholder in six weeks. Working agent: shipping schedules, draft bookings. Prototype only—no guardrails, no evals, failed on edge cases. Team of two. Stakeholder wanted to see "AI that works." We had to show something credible without over-promising.
Action: Ruthless prioritization. Must-haves: working agent, basic guardrails, 10 good demo scenarios. Nice-to-haves: full eval suite, dashboards—explicitly out of scope. Two-week sprint plan, daily standups. Weeks 1–2: core logic, tools, curated demo queries, minimal guardrails. Documented limitations and fallback answers. Weeks 3–4: polish, more scenarios. Delegated: one on agent, one on UI. Dry run two days before. Communicated scope early—"Phase 1" with Phase 2 roadmap. No surprises.
Result: Demo succeeded. Stakeholder approved budget for Phase 2. We delivered by scoping aggressively. Full eval and observability came in Phase 2—we didn't pretend we could do it all in six weeks. The stakeholder trusted us more because we were honest about what was in and what wasn't.
What makes this answer strong: You show prioritization, delegation, and stakeholder communication. You didn't over-promise. "Phase 1 / Phase 2" framing shows you think in milestones. Honesty about scope is a strength.
Story 8: Staying Current with AI
Situation: AI moves fast. Need to stay current for architecture decisions and team guidance. Delivery commitments limit time. "I'll read when I can" doesn't work—you never do.
Action: Layered approach. Daily: curated feed (Twitter, newsletters, internal channel), 15–20 min. Weekly: 1–2 deep reads on RAG, agents, evals. Monthly: try one new tool or technique, document findings. Quarterly: conference or workshop. Learn from the team—juniors bring new ideas. Share in tech talks and docs. Tie learning to decisions: "We adopted semantic caching after reading X—saved 40% API cost."
Result: Adopted semantic caching (-40% API cost), better embedding model (+15% retrieval), avoided over-investing in trends that didn't pan out. When someone asks "should we try X?" I can give a grounded answer. The system is sustainable—not "cram before interview" but "continuous, low-friction learning."
What makes this answer strong: You have a system, not "I read when I can." You tie learning to concrete outcomes (cost, accuracy). Shows you're not coasting and that you connect theory to practice.
10. Conversational Interview Q&A (Maersk-Referenced)
Q1: "Design a system that extracts structured data from emails and creates bookings."
Weak answer: "We'd use an LLM to extract and then call an API." No RAG, no confidence, no human loop, no handling of attachments or multiple bookings. Sounds like a weekend hackathon project.
Strong answer (Maersk): "At Maersk we do exactly this. Emails come in via API or mailbox sync. We parse and process attachments (PDF, Excel). An LLM extraction agent outputs structured JSON—cargo, origin, destination, dates, quantities—and we use RAG over a knowledge base of port codes, commodity codes, and customer info to enrich and validate, which cuts hallucinations on invalid codes. We score confidence per field and aggregate—above 0.9 we auto-create; below, human review. Multi-booking emails get split into separate extractions. Corrections feed back into evals. We track cost per booking, extraction latency, and human override rate."
Q2: "How do you handle when an AI system fails in production?"
Weak answer: "We'd add more tests." Vague, reactive, no structure. Doesn't show you've actually dealt with failure.
Strong answer (Maersk): "We ran a post-mortem on the email booking system when we saw wrong extractions and complaints. We categorized failures—multi-booking merge/drop, language accuracy, contradictory dates, reference emails. We added those to the eval dataset and improved the prompt: explicit multi-booking output, flag contradictions for human review, route non-English to a multilingual model. We lowered the auto-create threshold and added a guardrail: contradictory dates always go to human review. We set up monitoring—accuracy per field, override rate, failure by email type—and weekly evals. Accuracy went from 78% to 92%. The key is: structure the diagnosis, fix systematically, and add guardrails so it doesn't repeat."
Q3: "Design an enterprise AI agent platform."
Weak answer: "We'd have an API and some LLMs and tools." No multi-tenancy, no guardrails, no evaluation, no observability. Could describe half the startups in the Valley.
Strong answer (Maersk): "We built this at Maersk. API gateway for auth and rate limiting. Agent orchestration loads config, invokes tools via a registry, calls a centralized model router that handles multiple providers with per-tenant quotas and cost tracking. A guardrail engine applies input/output validation to every request—no agent bypasses. Tool registry supports MCP and custom tools with sandboxing and audit. Evaluation service has golden datasets, offline and online evals, A/B for prompts. Observability: MLflow for experiments, tracing for production. Prompt registry with versioning. Multi-tenancy: tenant ID in every request, row-level security. Teams define agents via config, submit for approval—we review guardrails, tool access, cost budget. The platform is infrastructure; teams own agent logic and quality."
Q4: "How do you decide between RAG and fine-tuning?"
Weak answer: "RAG for dynamic data, fine-tuning for static." Oversimplified, no nuance. Interviewer will probe and you'll crumble.
Strong answer (Maersk): "RAG when the knowledge changes often or we need citations—like our port/commodity knowledge base that updates regularly. Fine-tuning when the task is stable and we need consistent behavior or lower latency—e.g., domain-specific extraction schema. We've used both: RAG for enrichment in email booking, fine-tuning for some classification tasks. Hybrid is common: fine-tuned embedding model, RAG retrieval, fine-tuned small model for routing. The decision depends on update frequency, citation needs, and whether we have enough labeled data for fine-tuning."
Q5: "What's your approach to guardrails?"
Weak answer: "We filter inputs and outputs." No detail on what or how. Could mean anything.
Strong answer (Maersk): "Centralized guardrail engine—input and output. Input: prompt injection detection, PII masking, length limits, content boundaries. Output: content policy (toxicity, off-brand), format validation, citation verification for RAG, PII leak check. Configurable per agent—e.g., finance gets stricter PII rules. Per-tenant overrides. All violations logged. At Maersk, no agent can bypass; the orchestration layer invokes the engine on every request. Failures block or route to human. We use Azure Content Safety for moderation and custom rules for prompt injection and PII because our requirements are specific."
Q6: "How do you evaluate AI systems before and after deployment?"
Weak answer: "We test them." No structure, no metrics. Dead end.
Strong answer (Maersk): "Golden datasets per agent with expected inputs/outputs. Offline evals before deploy—accuracy, latency, cost. We run them in CI so regressions block release. Online evals: sample production traffic, same metrics. A/B tests for prompt changes. Human review samples for high-stakes domains. At Maersk we track extraction accuracy per field, human override rate, resolution rate for support agents. We added 50+ eval cases after our email booking failures. Weekly eval runs. Key: evals must cover edge cases, not just happy path."
Q7: "How do you balance speed and quality when shipping AI features?"
Weak answer: "We try to do both." No trade-off reasoning. Doesn't answer the question.
Strong answer (Maersk): "We scope aggressively for demos and Phase 1. Must-haves: working feature, minimal guardrails, known-good scenarios. Nice-to-haves: full eval suite, dashboards. We document limitations and set expectations. For production, we don't skip evals or observability—we've seen what happens. The balance: ship something useful quickly, but don't ship something that will silently fail or burn budget. At Maersk we've done 6-week demos that led to Phase 2 with full infrastructure. The key is being explicit about what's in and what's out, and having a plan to close the gap."
11. From Your Experience (Maersk Prompts)
Use these to practice connecting your experience to common questions:
-
"Tell me about a complex system you designed." → Enterprise AI Agent Platform: centralized model router, guardrail engine, tool registry, evaluation service, observability. Multi-tenancy, self-service deployment, piloted with two teams. Twelve production agents, 30% cost reduction, zero guardrail violations.
-
"Tell me about a time an AI system failed." → Email booking automation: wrong extractions, missed bookings, edge cases (multi-booking, language, contradictions). Post-mortem, eval expansion, prompt improvements, guardrails, monitoring. 78% → 92% accuracy, 50+ new eval cases.
-
"How do you work with stakeholders who disagree?" → Build vs. buy on RAG: hybrid approach (managed vector store + custom retrieval), PoC to prove feasibility, documented trade-offs, 3-month review. Stakeholder became advocate. Shipped in four weeks, 15% retrieval improvement.
12. Quick Fire Round
System design
- What's the difference between RAG and fine-tuning? RAG retrieves at query time; fine-tuning bakes knowledge into weights. RAG for dynamic/citable; fine-tuning for stable behavior.
- When do you use an agent vs. a single LLM call? Agent when you need tools, multi-step reasoning, or external APIs. Single call for stateless Q&A or classification.
- What's hybrid search? BM25 (keyword) + vector (semantic). Fusion (e.g., RRF) for better recall.
- Why rerank after retrieval? Cross-encoders improve precision. Retrieve more (e.g., 50), rerank to top 10.
- What's a circuit breaker? Stop calling a failing service after N failures. Prevents cascade, enables fallback.
- How do you reduce LLM cost? Model routing (cheap for simple), caching (semantic + prompt), token budgets, smaller models where possible.
- What guardrails do you need? Input: injection, PII, length. Output: content policy, format, citation verification.
- How do you evaluate RAG? Faithfulness, relevance, citation accuracy. Golden Q&A, human eval samples.
- When should contradictory inputs go to human review? Always—never auto-resolve when the model can't reconcile conflicting information. Maersk: contradictory dates → human.
Behavioral
- STAR stands for? Situation, Task, Action, Result.
- What makes a strong behavioral answer? Specific, quantified, shows thinking, honest about failure, emphasizes influence.
- How many stories should you prepare? Eight that flex across questions. Map to categories.
- "Tell me about a failure" — what to avoid? Blaming others, being passive, hiding the lesson.
- How do you show leadership without authority? Influence through docs, PoCs, alignment meetings. Build consensus.
- What's the "I would" trap? Hypotheticals are weak. Use "I did" with real examples.
- How long should a STAR answer be? 2–3 minutes. They'll ask for more if needed.
13. Key Takeaways
| Area | Takeaway |
|---|---|
| Requirements | Clarify before designing. Ask about users, scale, quality, constraints, LLM-specific needs. |
| Architecture | Diagram + data flow + justified decisions. "I chose X because Y." |
| Deep dive | Go deep on 1–2 subsystems. Discuss trade-offs, not just components. |
| Production | Monitoring, scaling, failure, cost, security. Think beyond happy path. |
| Interview structure | Requirements → architecture → deep dive → production → discussion. Don't run out of time. |
| STAR | Situation, Task, Action, Result. Specific, quantified, real examples. |
| Stories | 8 stories. Map to question types. Emphasize influence, mentoring, recovery. |
| Weak vs. strong | Weak: vague, hypothetical, no structure. Strong: specific, Maersk-anchored, systematic. |
| Maersk edge | Email booking, Enterprise AI Agent Platform, human-in-the-loop, guardrails, evals—you have the stories. Use them. |
14. Further Reading
- Anthropic, "Building Effective Agents" (Dec 2024): Patterns for agent design, workflows vs. agents, when to add complexity. Practical and opinionated.
- System Design Handbook, "Generative AI System Design Interview": Step-by-step framework, RAG deep dive, model routing, cost/safety/observability. Comprehensive.
- Google Cloud, "Choose a design pattern for your agentic AI system": ReAct, Reflexion, Tree-of-Thought, routing patterns. Good for agent architecture.
- OWASP Top 10 for LLM Applications (2025): Prompt injection, insecure output, training data poisoning, excessive agency. Security-focused.
- Tech Interview Handbook, "Behavioral interviews for senior candidates": Leadership spectrum, common pitfalls, story selection. Behavioral-specific.
End of Session 16. This completes the Senior AI Engineer interview preparation guide. You've got the architecture and the stories. Now go show them you can build things—and lead.