Session 19: Technical Leadership for Lead AI Engineer Roles
The jump from Senior to Lead isn't about writing better code—it's about making everyone else's code better. You stop being measured by your individual output and start being measured by the team's output. The questions change: not "did you ship it?" but "did the team ship it? Did they learn? Did they avoid the pitfalls you've already stepped in?" And in AI, that's harder than in most domains. Frameworks rotate monthly. Models improve quarterly. The "best practice" from last year is yesterday's anti-pattern. Your job is to create clarity in chaos.
This session is for you. You've built the Enterprise AI Agent Platform at Maersk. You've shipped email booking automation. You've mentored backend engineers into AI development. You've run workshops and created internal content as an AI Advocate. Now you need to package that into the language of Lead—technical direction, design reviews, platform thinking, mentoring, cross-functional collaboration, and the behavioral stories that prove you can run the show. Let's go.
1. Setting Technical Direction for AI Teams
Interview Insight: "How do you set technical direction?" is really "Can you balance innovation and delivery? Can you say no?" They want a decision framework, not a list of tools you like.
How to define an AI strategy that balances innovation and delivery. Choosing frameworks, models, and patterns for the team. Creating architecture decision records (ADRs). Balancing "build the best" vs "ship fast." How to say no to shiny new tools.
The restaurant analogy: Imagine you're the head chef. You want to try every new technique—molecular gastronomy, fermentation, sous-vide everything. But you have a menu, customers waiting, and a kitchen that needs to execute. Technical direction is deciding: this we standardize (everyone uses the same stove), that we experiment with (one station gets the new gear), and that we explicitly defer (maybe next year). You're not banning creativity—you're creating guardrails so the team doesn't chase every trend and ship nothing.
Practical tactics:
-
Define a stack early. At Maersk, we standardize on LangChain/LangGraph for agent orchestration, Azure OpenAI for models, MLflow for observability, DeepEval/RAGAS for evaluations. New teams don't get to pick "whatever framework sounds cool"—they inherit the platform. That's not tyranny; it's leverage. Shared stack = shared tooling, shared docs, shared debugging.
-
ADR everything. When we chose LangGraph over raw LangChain for agent flows, we wrote an ADR. Why? Maintainability, state management, checkpointing for human-in-the-loop. When we rejected Framework X for "it's newer": ADR. When we adopted MCP for tool integration: ADR. ADRs force you to articulate why, and future-you (or a new hire) will thank present-you.
-
"Ship fast" vs "build the best" matrix. Plot features: high impact + low complexity = ship fast. High impact + high complexity = build well, but scope ruthlessly. Low impact = don't build unless it's quick. The email booking system was high impact—we shipped a Phase 1 with human-in-the-loop in 6 weeks. The AI platform was high complexity—we took longer but built guardrails and evals from day one because we knew teams would depend on it.
-
Saying no to shiny tools. Rule of thumb: if it doesn't solve a problem we have today, we don't adopt it. "But Framework Y just released!"—does it give us something our stack can't? If yes, PoC it. If no, we document it as "evaluated, not needed" and move on. I've seen teams burn months chasing the latest agent framework. We picked LangGraph when it was stable; we'll revisit when we hit its limits.
flowchart TB
subgraph Input["New Tool/Framework Request"]
req[Request]
end
req --> Q1{Does it solve a problem we have today?}
Q1 -->|No| Reject[Document & Defer]
Q1 -->|Yes| Q2{Does our stack already cover it?}
Q2 -->|Yes| Reject
Q2 -->|No| PoC[PoC with clear success criteria]
PoC --> Eval{Meets criteria?}
Eval -->|Yes| ADR[Write ADR, adopt]
Eval -->|No| RejectWhy This Matters: Without direction, every team invents their own LLM gateway, their own guardrails, their own eval pipeline. You get 12 agents with 12 different failure modes and zero shared learnings. Technical direction isn't about control—it's about compound returns.
Aha Moment: The best technical direction feels boring. "We use LangChain" isn't sexy. "We evaluated 8 frameworks and chose LangChain because X, Y, Z" is leadership.
2. Running Design Reviews for AI Systems
Interview Insight: Design reviews test whether you can evaluate AI systems before they fail. Interviewers want to know: what do you look for? How do you run them without being a gatekeeper?
What to evaluate: model selection, data pipeline, evaluation strategy, cost projections, failure modes, security. How to run reviews constructively (not gatekeeping). Design review templates for AI systems. Common red flags: no eval plan, no cost estimate, no fallback strategy.
The flight checklist analogy: Pilots don't skip the pre-flight checklist because they've flown before. Design reviews are your pre-flight for AI systems. You're not trying to catch people doing it wrong—you're making sure no one forgets the obvious. "Did we check the fuel?" = "Did we estimate cost?" "Did we plan for engine failure?" = "What's the fallback when the model hallucinates?"
What to evaluate:
| Dimension | Questions to Ask | Red Flags |
|---|---|---|
| Model selection | Why this model? Cost vs accuracy? Fallback? | "We'll use GPT-4 for everything." No tiering, no cost thought. |
| Data pipeline | Where does training/retrieval data come from? Freshness? Quality? | No data lineage, no eval on edge cases. |
| Evaluation | How will you know it works? Golden dataset? Offline evals? | No eval plan. "We'll test manually." |
| Cost | Tokens per request? Monthly projection? Budget? | No cost estimate. "We'll optimize later." |
| Failure modes | What if model is down? Hallucination spike? Prompt injection? | No fallback, no guardrails, no circuit breaker. |
| Security | PII handling? Guardrails? Audit trail? | No mention of PII, content policy, or compliance. |
Running reviews constructively:
-
Frame it as "let's make this bulletproof." You're on their side. "What happens when X fails?" isn't criticism—it's collaboration. At Maersk, we run design reviews for every new agent on the platform. The goal: catch issues before they hit production.
-
Use a template. Consistent structure = nothing slips. Ours: problem statement, architecture diagram, model choice + rationale, data/eval plan, cost estimate, failure modes + mitigations, security considerations, timeline. If a section is empty, we pause.
-
Assign action items, not vetoes. "Add an eval plan by Friday" beats "this is wrong, redo it." Gatekeeping is "you can't ship." Constructive is "you can ship once these are done."
flowchart LR
subgraph Review["Design Review Flow"]
Submit[Team Submits Design] --> Check[Reviewer Checklist]
Check --> Discuss[Discussion]
Discuss --> Actions{Action Items?}
Actions -->|Yes| Fix[Team Addresses]
Fix --> Recheck[Re-review]
Recheck --> Approve[Approve]
Actions -->|No| Approve
Approve --> Ship[Ship]
endWhy This Matters: The email booking system had a post-mortem because we missed edge cases in design—multi-booking, contradictory dates, non-English. A design review with "what are the failure modes?" would have caught some of that up front.
Aha Moment: The red flags are almost always the same: no eval plan, no cost estimate, no fallback. If you fix those three, you've eliminated 80% of production surprises.
3. Building Reusable Components and Platform Patterns
Interview Insight: Platform work is leadership. "Tell me about building reusable components" maps directly to the Maersk AI Platform. They want to hear: when to build shared vs DIY, what you built, and how it helped others.
When to build shared tooling vs let teams DIY. Internal SDKs, shared prompt libraries, evaluation frameworks. The platform mindset: reduce friction for other teams. This maps directly to Bhanu's Maersk AI Platform work.
The power grid analogy: You don't give every house a diesel generator. You build a grid. Some houses add solar (team-specific logic)—fine. But everyone plugs into the same infrastructure. The AI platform is your grid: centralized LLM access, guardrails, tools, evals, observability. Teams build agents that plug in—they don't rewire the whole system.
When to build shared vs DIY:
- Build shared when: (1) Multiple teams need it. (2) Failure is costly (guardrails, security). (3) Consistency matters (evaluations, cost tracking). (4) The problem is hard and you don't want everyone to solve it (model routing, prompt management).
- Let teams DIY when: (1) One-off or experimental. (2) Domain-specific (their RAG strategy, their tools). (3) Too early to standardize—let them explore, then extract patterns.
What we built at Maersk:
- Centralized model router: One gateway. Per-tenant quotas, cost tracking, fallbacks. Teams don't manage API keys or retry logic.
- Guardrail engine: Input/output validation, PII masking, content policy. No agent bypasses—enforced at orchestration.
- Tool registry: MCP and custom tools, sandboxing, audit. Teams register; platform enforces.
- Evaluation service: Golden datasets, offline + online evals, A/B for prompts. Built once, used by all agents.
- MLflow observability: Tracing, cost per request, experiment tracking. Every agent gets it for free.
- Prompt management: Versioning, templating, A/B. Shared library of proven patterns.
flowchart TB
subgraph Platform["Maersk AI Platform"]
Router[Model Router]
Guardrails[Guardrail Engine]
Tools[Tool Registry]
Evals[Evaluation Service]
MLflow[MLflow Observability]
Prompts[Prompt Management]
end
subgraph Agents["Team Agents"]
A1[Email Booking Agent]
A2[Support Agent]
A3[Document Agent]
end
A1 --> Router
A1 --> Guardrails
A1 --> Tools
A2 --> Router
A2 --> Guardrails
A3 --> Router
Router --> MLflow
Evals --> A1
Evals --> A2
Evals --> A3Why This Matters: Before the platform, every team had their own LLM setup, ad-hoc guardrails, and no shared evals. We had 12 ways to fail. Now we have one way to succeed—and when we fix a bug, everyone benefits.
Aha Moment: The platform mindset is "how do I make the next team's job easier?" Not "how do I build something cool?" If you're optimizing for others' productivity, you're doing it right.
4. Mentoring Engineers New to LLMs and AI
Interview Insight: "How do you mentor engineers new to AI?" They want a structured approach—not "I pair with them sometimes." Learning paths, documentation, workshops. Your AI Advocate work is gold here.
Onboarding backend/ML engineers into AI agent development. Structured learning paths. Pair programming on AI tasks. Creating internal documentation and "AI Advocate" content. Running workshops and knowledge-sharing sessions.
The driving instructor analogy: You don't hand someone the keys and say "good luck." You start in a parking lot—simple prompts, single-call flows. Then quiet streets—RAG, tool calling. Then highway—agents, multi-step, production. Each stage has clear goals and checkpoints. Mentoring is scaffolding, not dumping.
Practical tactics:
-
Structured learning path: Week 1–2: LLM basics, prompt engineering, structured output. Week 3–4: RAG, retrieval, chunking. Week 5–6: Agents, tools, LangGraph. Week 7–8: Production—evals, guardrails, observability. Capstone: build a small agent end-to-end.
-
Pair on real tasks: Don't give them toy exercises. Pair on a real feature—email extraction, a new tool, an eval. They learn by doing; you catch mistakes early.
-
Internal docs and "AI Advocate" content: At Maersk, we created docs on prompt patterns, RAG best practices, eval setup, common pitfalls. AI Advocate content—blog posts, demos, "how we built X"—reduces the same questions. Document once, scale mentoring.
-
Workshops: Monthly "AI Office Hours." Quarterly deep-dives: "RAG from scratch," "Building your first agent." Hands-on, not slides. People remember what they build.
flowchart TB
subgraph Path["Onboarding Path"]
L1[LLM + Prompting]
L2[RAG]
L3[Agents + Tools]
L4[Production]
L1 --> L2 --> L3 --> L4
end
subgraph Support["Support Channels"]
Docs[Internal Docs]
Pair[Pair Programming]
Workshops[Workshops]
OfficeHrs[Office Hours]
end
L1 -.-> Docs
L2 -.-> Pair
L3 -.-> Workshops
L4 -.-> OfficeHrsWhy This Matters: We onboarded backend engineers who'd never touched LLMs. In 2–3 months they were shipping agents. That's not magic—it's structure.
Aha Moment: Mentoring scales through documentation. The best mentors write things down. When someone asks "how do I do X?", the answer is "read this doc." You've multiplied yourself.
5. Cross-Functional Collaboration
Interview Insight: Lead roles require collaboration beyond engineering. Product, data science, security, infrastructure. They want to hear how you translate AI capabilities into features, manage expectations, and work with non-technical stakeholders.
Working with product (translating AI capabilities into features), data science (model evaluation, experimentation), security (guardrails, compliance), infrastructure (GPU/cloud resources). Managing expectations: what AI can and can't do.
The translator analogy: Product speaks "features." Data science speaks "metrics." Security speaks "risk." You speak "models, tokens, evals." Your job is to translate. "Can we add a summarization feature?" → "Yes, but we need to define accuracy, latency, and cost. Here's a 2-week spike to scope it."
By function:
| Partner | What they want | What you need from them | Tension to manage |
|---|---|---|---|
| Product | Features, roadmap, user value | Clear requirements, prioritization, acceptance criteria | "AI can do anything" vs reality. Set expectations: accuracy, latency, edge cases. |
| Data Science | Model quality, experimentation | Labeled data, eval datasets, feedback loops | "Perfect model" vs "good enough to ship." Agree on metrics, run A/B. |
| Security | Compliance, PII, audit | Requirements early, review cycles | "Just ship it" vs "we need X before production." Involve them in design. |
| Infrastructure | Stability, cost, scaling | GPU quotas, rate limits, SLAs | "We need 10x capacity" vs budget. Plan capacity ahead. |
Managing "what AI can and can't do": Be explicit. "The model will extract 95% of fields correctly in English. Non-English and complex formats need human review." Document limitations. When product asks for 100% accuracy with zero human review, explain why that's not feasible—and what is (e.g., 95% auto, 5% human, continuous improvement).
Why This Matters: The email booking system shipped because we aligned with product on Phase 1 scope (English, common formats, human-in-the-loop) and deferred "fully automated everything" to Phase 2. Without that, we'd still be in scope creep.
Aha Moment: "Underpromise, overdeliver" is cliché but true for AI. Set expectations at 90%; if you hit 95%, you're a hero. Promise 100% and miss—you've lost trust.
6. Hiring and Evaluating AI Engineering Candidates
Interview Insight: "What do you look for in AI engineers?" Production experience > paper knowledge. System design, debugging, cost awareness. They want a hiring philosophy, not a generic checklist.
What to look for: production experience > paper knowledge, system design ability, debugging skills, cost awareness. Interview design for AI roles. Red flags: can't explain trade-offs, only knows one framework, no production experience.
The ship captain analogy: You don't hire someone who's read about sailing. You hire someone who's sailed in bad weather. AI is the same—papers are useful, but production scars are better. Has they debugged a hallucination spike at 2am? Handled a cost overrun? Run a post-mortem?
What to evaluate:
| Signal | Strong | Weak |
|---|---|---|
| Production experience | Shipped AI to users, ran evals, dealt with failures | Only hackathons, demos, coursework |
| System design | Can reason about RAG vs fine-tuning, model routing, failure modes | Reaches for one pattern, can't justify |
| Debugging | "When X failed, I checked Y, found Z, fixed by..." | Vague, blames the model |
| Cost awareness | Thinks in tokens, knows pricing, mentions optimization | "We'll use GPT-4 for everything" |
| Trade-offs | "I chose X over Y because..." | Can't articulate why |
Red flags: Can't explain trade-offs. Only knows one framework (LangChain or nothing). No production experience—all theory. Dismisses evaluations ("we'll test it"). Doesn't ask about scale, latency, or cost.
Interview design: System design (design an email extraction system), coding (implement a retrieval function, add a tool), behavioral (tell me about a time an AI system failed), and a "debugging" scenario (here's a trace with high latency—what do you check?).
Why This Matters: Bad hires cost 12–18 months. A senior who can't reason about production will ship fragile systems and create tech debt. Screen for the right signals.
Aha Moment: The best candidates ask you questions—about scale, failure modes, how you evaluate. That's curiosity and systems thinking. One-way interrogations miss that.
7. Managing Technical Debt in Fast-Moving AI Landscape
Interview Insight: "How do you manage technical debt when frameworks change monthly?" They want a strategy—not "we rewrite everything" or "we never upgrade." Deprecation, when to patch vs rewrite, dealing with framework fatigue.
Frameworks change monthly. Models improve quarterly. How to upgrade without breaking production. Deprecation strategies. When to rewrite vs when to patch. Managing the "latest framework" fatigue.
The house renovation analogy: You don't tear down the house because the kitchen is outdated. You renovate room by room. Same with AI systems—incremental upgrades, deprecated paths, clear migration plans. And sometimes you do tear down—when the foundation is cracked (e.g., a framework that's abandoned).
Practical tactics:
-
Version pinning + upgrade windows: Pin LangChain, Azure OpenAI SDK, etc. Upgrade on a schedule—quarterly or bi-annually. Don't chase every release. Test in staging; roll out.
-
Deprecation strategy: When we move from Pattern A to Pattern B, we: (1) Document the migration path. (2) Deprecation warnings in code. (3) Grace period (3–6 months). (4) Remove old path. No silent breaks.
-
When to rewrite vs patch: Patch when: the change is localized, risk is low, ROI is clear. Rewrite when: the abstraction is wrong (e.g., we built on a pattern that doesn't scale), maintenance cost is prohibitive, or the underlying tech is dead.
-
Framework fatigue: Standardize on a small set. "We use LangGraph for agents. We don't evaluate new frameworks unless we hit a wall." Reduces decision fatigue and upgrade churn.
Why This Matters: We've had to upgrade LangChain twice in 18 months. Each time: run evals, fix breaking changes, deploy. Without a process, we'd have drifted into unmaintainable forks.
Aha Moment: Technical debt isn't always bad. Sometimes "good enough" today beats "perfect" never. The key is intentional debt—we know what we're deferring and why.
8. Communication with Non-Technical Stakeholders
Interview Insight: "How do you explain AI to executives?" Translate accuracy, hallucination rate, latency into business terms. Set realistic expectations. They want someone who can speak C-suite.
Explaining AI capabilities and limitations to business. Setting realistic expectations. Translating "accuracy" and "hallucination rate" into business terms. Executive-level reporting on AI systems.
The doctor analogy: A doctor doesn't say "your hemoglobin is 12.3 g/dL." They say "your blood looks good." Translate technical metrics into outcomes. "95% extraction accuracy" → "19 out of 20 bookings are correct without human review." "Hallucination rate 2%" → "2% of responses may need correction."
Translations:
| Technical | Business |
|---|---|
| Extraction accuracy 92% | 92 out of 100 emails auto-processed correctly; 8 need human check |
| Latency P95 1.2s | Most responses in under 1.2 seconds |
| Cost $0.05/booking | Five cents per automated booking |
| Hallucination rate 1.5% | About 1–2% of outputs may be wrong; we catch with validation |
| Human-in-the-loop 8% | 8% of cases go to human review for quality |
Setting expectations: "AI will not be perfect. We design for graceful degradation—when in doubt, human review. We improve over time with feedback." Never promise 100%. Always frame as "we're reducing manual work by X% while maintaining quality."
Executive reporting: One-pager: what we built, key metrics (throughput, accuracy, cost), risks, next steps. No jargon. "We automated 1,200 bookings/month. 92% accuracy. $X saved. Next: improve non-English handling."
Why This Matters: Executives fund projects. If they don't understand what AI can and can't do, they'll either overexpect or kill good projects. Clear communication is survival.
Aha Moment: The best communicators lead with outcomes. "We reduced manual work by 40%" before "we use RAG and LangGraph." Outcomes first, tech second.
9. Incident Ownership and Playbook Design
Interview Insight: "How do you handle AI incidents?" Model degradation, hallucination spikes, cost overruns, prompt injection. They want a playbook: detection, triage, mitigation, root cause, prevention.
AI-specific incidents: model degradation, hallucination spikes, cost overruns, prompt injection. Playbook structure: detection, triage, mitigation, root cause, prevention. On-call for AI systems. Post-mortems.
The emergency room analogy: Triage first—what's critical? Then stabilize—stop the bleeding. Then diagnose—why did it happen? Then prevent—how do we avoid it next time? Incidents follow the same arc.
AI-specific incident types:
| Incident | Detection | Triage | Mitigation |
|---|---|---|---|
| Model degradation | Accuracy drops, eval regression | Check model version, data drift | Roll back model, revert prompt, route to fallback |
| Hallucination spike | User reports, override rate up | Sample failures, check prompts | Add guardrails, constrain output, human review |
| Cost overrun | Spend > budget | Identify top consumers | Rate limit, model routing, cache |
| Prompt injection | Unexpected behavior, guardrail violations | Review logs, reproduce | Tighten input validation, block known patterns |
| Model/API outage | 5xx, timeouts | Check provider status | Circuit breaker, fallback model, cached responses |
Playbook structure:
- Detection: Alerts on latency, error rate, cost, eval regression, guardrail violations.
- Triage: Severity (P1: user impact, P2: degraded, P3: minor). Assign owner.
- Mitigation: Stop the damage. Roll back, rate limit, fallback.
- Root cause: Post-mortem. What happened? Why? Timeline.
- Prevention: What do we add? Monitoring, guardrails, tests. Blameless.
flowchart TB
subgraph Incident["Incident Response"]
Detect[Detection / Alert]
Triage[Triage: Severity + Owner]
Mitigate[Mitigation: Stop Damage]
RCA[Root Cause Analysis]
Prevent[Prevention: Fix + Monitor]
Detect --> Triage --> Mitigate --> RCA --> Prevent
endOn-call for AI: Rotate. Include runbooks for each incident type. "If accuracy drops: check eval, check model version, roll back if needed." Post-mortem within 48 hours. Document learnings.
Why This Matters: We ran a post-mortem on the email booking failures—wrong extractions, edge cases. We added evals, guardrails, monitoring. It didn't happen again. Incidents are learning opportunities.
Aha Moment: The best teams treat incidents as system improvement. Every P1 should result in a new alert or guardrail. If you're fixing the same thing twice, you didn't learn.
10. Lead-Level STAR Stories
Interview Insight: Behavioral questions for Lead roles test: direction-setting, conflict resolution, influence, mentoring, risk communication. Prepare 6–8 stories. Map each to multiple questions. Use Maersk.
Prepare these with full STAR (Situation, Task, Action, Result). Each should be 2–3 minutes. Reference Maersk AI Platform and email booking work.
Story 1: Setting Technical Direction
Question: "Tell me about a time you set technical direction for a team."
STAR: Situation: Multiple teams were building AI agents with ad-hoc setups—different LLM access, no shared guardrails, no evals. Task: Define a standard stack and platform so we could scale without chaos. Action: I proposed the Enterprise AI Agent Platform—centralized model router, guardrail engine, tool registry, evaluation service. Wrote ADRs for LangGraph, Azure OpenAI, MLflow. Piloted with email booking and one other team. Documented patterns, ran design reviews for new agents. Result: Twelve production agents now use the platform. 30% cost reduction from model routing. Zero guardrail violations. New teams onboard in days, not weeks.
Story 2: Handling Disagreements Between Senior Engineers
Question: "How do you handle disagreements between senior engineers?"
STAR: Situation: Two senior engineers disagreed on RAG vs fine-tuning for a new extraction use case. One argued fine-tuning for latency; the other RAG for flexibility. Task: Get to a decision without burning bridges. Action: I ran a structured debate: each presented their case with trade-offs. We defined success criteria—latency, accuracy, update frequency. I proposed a PoC: build both approaches on a subset, measure. Two weeks. Result: PoC showed RAG met latency targets and was easier to update. We went with RAG. The "fine-tuning" advocate agreed; we documented the decision in an ADR. Both engineers contributed to the final design.
Story 3: Build vs Buy Decision
Question: "Tell me about a time you had to make a build-vs-buy decision."
STAR: Situation: We needed a vector store for RAG. Option A: managed service (Pinecone, Azure AI Search). Option B: self-hosted (Qdrant, pgvector). Task: Choose with limited budget and one engineer. Action: I evaluated: managed = faster to ship, higher ongoing cost, vendor lock-in. Self-hosted = more work up front, lower cost, control. We had Kubernetes expertise from the Knative platform work. I proposed hybrid: managed for Phase 1 (ship in 2 weeks), evaluate self-hosted for Phase 2 if cost became an issue. Documented trade-offs, got stakeholder sign-off. Result: Shipped in 2 weeks. At 6 months, we migrated to self-hosted Qdrant—15% retrieval improvement, 40% cost reduction. The hybrid approach let us prove value before investing in build.
Story 4: Balancing Innovation with Delivery
Question: "How do you balance innovation with delivery?"
STAR: Situation: We had pressure to ship email booking automation while also exploring newer agent patterns (ReAct, multi-agent). Task: Ship production value without getting stuck in research. Action: I split the roadmap: Phase 1 = proven patterns—single extraction agent, RAG, human-in-the-loop. 6-week delivery. Phase 2 = innovation—multi-agent, advanced planning—as experiments with clear "promote to production" criteria. We didn't block Phase 1 on unproven tech. Result: Phase 1 shipped on time. Phase 2 experiments informed Phase 3; we adopted multi-agent for complex emails only where it showed clear improvement. Delivered both—reliability first, innovation in parallel.
Story 5: Mentoring Someone Who Grew Significantly
Question: "Tell me about mentoring someone who grew significantly."
STAR: Situation: A backend engineer with no LLM experience joined the team. Needed to own the email extraction agent. Task: Get them productive and confident. Action: Structured path: LLM basics (1 week), prompt engineering (1 week), pair on extraction logic (2 weeks), then RAG and evals (2 weeks). I paired on the first two features; after that, they drove with review. Created internal docs they could reference. Ran a workshop they helped present. Result: In 3 months they owned the extraction pipeline, added 3 new fields, improved eval coverage. They're now one of our AI advocates—running sessions for other teams. Mentoring scaled through docs and workshops.
Story 6: Handling Underperforming Team Members
Question: "How do you handle underperforming team members?"
STAR: Situation: A team member was missing deadlines, code quality was inconsistent, and they seemed disengaged. Task: Address performance without losing them. Action: I had a direct 1:1. Fact-based: "These three deliverables were late. Here's the impact." I asked what was going on—personal? skill gap? unclear expectations? They admitted they were overwhelmed by AI complexity. I reduced scope, paired them with a stronger engineer on the next task, and set weekly check-ins. Clear expectations: "By week 4, you'll own X. I'm here to unblock." Result: They improved. Delivered the next project on time. Still on the team. The key was diagnosing the cause—it wasn't effort, it was support. Not every underperformer needs to leave; some need scaffolding.
Story 7: Influencing Without Authority
Question: "Tell me about a time you influenced without authority."
STAR: Situation: Platform leadership was leaning toward a different evaluation framework than the one we'd proven on the email booking system. I had no formal say. Task: Get our approach adopted. Action: I didn't argue in meetings. I built a comparison: ran both frameworks on our golden dataset, documented accuracy, latency, maintenance cost. Presented a 1-pager with a recommendation. Shared with the platform lead before the decision. Offered to pilot our approach with two more teams. Result: They adopted our framework. Influence came from data and low-friction collaboration—"here's what we learned, want to try it?"—not authority.
Story 8: Communicating Technical Risk to Executives
Question: "How do you communicate technical risk to executives?"
STAR: Situation: We discovered a potential PII leakage path in an agent—the model could regurgitate training data in edge cases. Executives needed to understand without panic. Task: Communicate risk and mitigation clearly. Action: I prepared a 1-pager: What's the risk? (Low—theoretical, no observed incident.) What's the impact if it happens? (PII exposure, compliance.) What are we doing? (Input masking, output validation, audit.) Timeline? (Fix in 2 weeks.) I presented in plain language. No jargon. "We found a gap. Here's how we're closing it." Result: They approved the fix, asked for a follow-up. No escalation. Clear, calm, factual.
11. Common Lead-Level Interview Questions & Model Answers
Q1: "What's your approach to technical debt?"
Weak answer: "We try to pay it down when we can." Vague. No strategy.
Strong answer (Maersk): "We treat it as intentional. When we take on debt—e.g., using an older LangChain pattern to ship faster—we document it and set a review date. We pin versions and upgrade on a schedule so we don't drift. For the AI platform, we have a quarterly 'tech health' review: what's blocking us? What should we refactor? We've had to upgrade frameworks twice in 18 months; having a process made it manageable. The key is: debt we choose and track, not debt we ignore."
Q2: "How do you prioritize when everything is urgent?"
Weak answer: "I use a backlog and prioritize by impact." Generic. No framework.
Strong answer (Maersk): "I use impact x effort, plus a few rules. P0 incidents always win. Security and compliance get a lane. For features, I ask: user impact, revenue impact, or platform health? We triage weekly. I also push back: 'If everything is P1, nothing is.' I've had to say 'we're not doing X this quarter' so we could finish the platform guardrails. Saying no is part of prioritization. At Maersk, we shipped email booking Phase 1 by cutting scope—we deferred 'nice-to-haves' and focused on core extraction and human review. Delivered in 6 weeks."
Q3: "How do you stay current in AI when things change so fast?"
Weak answer: "I read papers and follow Twitter." Passive. No filter.
Strong answer (Maersk): "I'm selective. I follow a few key sources—Anthropic engineering, LangChain changelog, major model releases. I don't chase every framework. For the team, we have a monthly 'AI sync'—someone presents something new, we discuss relevance. If it solves a problem we have, we PoC it. If not, we note it and move on. I also learn by building—the platform forces us to touch evals, guardrails, cost tracking. That keeps skills sharp. The goal isn't to know everything; it's to know what matters for our systems."
Q4: "Describe a time you had to push back on a stakeholder."
Weak answer: "I had to tell product we couldn't do something." No context, no resolution.
Strong answer (Maersk): "Product wanted fully automated booking with zero human review—'AI should handle everything.' I pushed back: we didn't have the accuracy, and high-stakes mistakes (wrong port, wrong date) would hurt customers. I proposed a phased approach: Phase 1 with human-in-the-loop for low-confidence cases, measure accuracy, then gradually increase auto-create as we proved quality. I showed our eval results—we were at 85%, not 99%. They agreed. We shipped Phase 1, hit 92% within 3 months, and only then expanded auto-create. Pushing back wasn't 'no'—it was 'here's a safer path.'"
Q5: "How do you create a culture of ownership?"
Weak answer: "We have ownership in our values." Buzzwords.
Strong answer (Maersk): "Ownership means you run it in production. For each agent on the platform, we assign an owner—they're on-call, they own evals, they do post-mortems. We also make ownership visible: dashboards per agent, cost per team, accuracy metrics. When something breaks, we do blameless post-mortems—focus on process, not people. And we celebrate ownership: when the email booking team improved accuracy from 78% to 92%, we called it out. Ownership isn't a poster; it's structure and recognition."
Q6: "What's the biggest mistake you've made as a technical leader?"
Weak answer: "I work too hard." Deflection. Not a real mistake.
Strong answer (Maersk): "We shipped the first version of email extraction without enough eval coverage for edge cases—multi-booking emails, contradictory dates, non-English. We got production complaints, wrong extractions, and had to run a post-mortem. My mistake was prioritizing speed over robustness. We fixed it: added 50+ eval cases, tightened confidence thresholds, added guardrails for contradictions. Accuracy went from 78% to 92%. The lesson: for production AI, evals and edge cases aren't optional. I've applied that to every agent since—design reviews now always ask 'what are the failure modes?'"
Q7: "How do you handle a team member who wants to leave?"
Weak answer: "I try to understand and support them." Fine but thin.
Strong answer (Maersk): "I have a conversation. Why? Growth? compensation? project? Sometimes it's fixable—different project, raise, scope. Sometimes it's not. I don't take it personally. I also think about knowledge transfer: what do they own? Can we document it, pair with someone, create a handoff plan? When we had a key engineer leave, we did a 2-week overlap, ran a knowledge-sharing session, and updated runbooks. The team didn't skip a beat. And I stay in touch—good leavers become future advocates. The goal is a clean transition and a positive relationship."
Q8: "How do you measure success as a technical lead?"
Weak answer: "Team velocity and delivery." One-dimensional.
Strong answer (Maersk): "Multiple dimensions. Delivery: did we ship what we committed? Quality: are our systems reliable? Evals passing? Guardrails in place? Growth: did the team learn? Can they run without me? Sustainability: is our tech debt manageable? Are we on-call and not burning out? For the AI platform, success was: 12 agents in production, 30% cost reduction, zero guardrail violations, and 3 engineers who can now run design reviews themselves. If I'm the bottleneck, I've failed. Success is the team thriving without me in the critical path."
Conversational Interview Q&A
Q1: "How do you set technical direction for an AI team?"
Weak answer: "I pick the best tools and make sure everyone follows." Sounds autocratic. No process.
Strong answer (Maersk): "I start with constraints—what do we need to standardize for scale? At Maersk, we defined a stack: LangChain/LangGraph for agents, Azure OpenAI, MLflow for observability. I write ADRs for major decisions so we have a record of why. For new tools, we have a bar: does it solve a problem we have today? If yes, PoC it. If no, we document and defer. I also balance ship-fast vs build-well—high-impact, low-complexity features ship quickly; platform work gets more rigor. The direction isn't 'my way'—it's 'what helps the team compound.'"
Q2: "What do you look for in a design review for an AI system?"
Weak answer: "Good architecture and code quality." Too vague. Misses AI specifics.
Strong answer (Maersk): "Five things: (1) Model selection with rationale—why this model? Cost? Fallback? (2) Evaluation plan—how will we know it works? Golden dataset? (3) Cost estimate—tokens per request, monthly projection. (4) Failure modes—what if the model hallucinates? Goes down? Prompt injection? (5) Security—PII, guardrails, audit. The red flags: no eval plan, no cost estimate, no fallback. We run design reviews for every new agent on the platform. The goal is to catch issues before production—we learned that the hard way with email booking."
Q3: "How do you mentor engineers new to AI?"
Weak answer: "I pair with them and answer questions." No structure.
Strong answer (Maersk): "Structured path: LLM basics and prompting, then RAG, then agents and tools, then production—evals, guardrails, observability. I pair on real tasks, not toy exercises. We created internal docs—prompt patterns, RAG best practices, eval setup—so mentoring scales. I also ran workshops as an AI Advocate: monthly office hours, quarterly deep-dives. We onboarded backend engineers who'd never touched LLMs; in 2–3 months they were shipping agents. The key is scaffolding and documentation—you can't scale mentoring 1:1."
Q4: "How do you communicate AI limitations to non-technical stakeholders?"
Weak answer: "I tell them AI isn't perfect." Unhelpful. No translation.
Strong answer (Maersk): "I translate metrics to outcomes. '95% extraction accuracy' becomes '19 out of 20 emails are correct without human review.' 'Hallucination rate 2%' becomes 'about 2% may need correction—we catch them with validation.' I'm explicit about what we defer: 'Phase 1 is English and common formats. Non-English and edge cases go to human review.' I never promise 100%. I frame as 'we're reducing manual work by X% while maintaining quality.' For executives, I use one-pagers: what we built, key metrics, risks, next steps. Outcomes first, tech second."
Q5: "Tell me about a time you had to say no to a technical approach."
Weak answer: "I said no to using a new framework." No context or outcome.
Strong answer (Maersk): "A team wanted to use a new agent framework that had just been released. I said no for Phase 1. Reason: we'd standardized on LangGraph for a reason—state management, checkpointing, human-in-the-loop. The new framework wasn't battle-tested. I proposed: ship Phase 1 on LangGraph, run a parallel PoC with the new framework. If it proved superior, we'd consider for Phase 2. They agreed. We shipped on time. The PoC showed the new framework wasn't ready for our use case. Saying no wasn't arbitrary—it was risk management. We documented the decision in an ADR."
Q6: "How do you handle incidents in AI systems?"
Weak answer: "We fix them and do a post-mortem." No structure.
Strong answer (Maersk): "We have a playbook. Detection: alerts on latency, error rate, cost, eval regression, guardrail violations. Triage: severity and owner. Mitigation: roll back model, route to fallback, rate limit—depends on the incident type. Root cause: blameless post-mortem within 48 hours. Prevention: what do we add? Alerts, guardrails, tests. For email booking, we had wrong extractions—we ran a post-mortem, added evals for edge cases, tightened confidence thresholds, added guardrails for contradictory dates. Accuracy went from 78% to 92%. Incidents are learning opportunities; every P1 should create a new safeguard."
Q7: "What's your approach to building a platform vs letting teams DIY?"
Weak answer: "We build platform for common things." Vague.
Strong answer (Maersk): "Build shared when: multiple teams need it, failure is costly, or consistency matters. At Maersk, we built the AI platform—model router, guardrails, evals, observability—because every agent needed those. Teams would have built 12 different versions badly. Let teams DIY when: one-off, experimental, or too early to standardize. They own their agents, RAG strategy, tools. We provide the grid; they plug in. The platform reduced onboarding from weeks to days. When we fix a bug, everyone benefits. Platform mindset: reduce friction for the next team."
Q8: "How do you evaluate AI engineering candidates?"
Weak answer: "I look for experience and culture fit." Generic.
Strong answer (Maersk): "Production experience over papers. Have they shipped AI to users? Debugged a hallucination spike? Run evals? I care about system design—can they reason about RAG vs fine-tuning, model routing, failure modes? Cost awareness—do they think in tokens? Red flags: can't explain trade-offs, only knows one framework, no production experience, dismisses evals. We do system design (design an extraction system), coding (implement retrieval, add a tool), and a debugging scenario (here's a trace—what do you check?). The best candidates ask us questions about scale and failure modes. That's curiosity."
From Your Experience
Prepare your own stories. Use these prompts to reflect on your Maersk lead-level work:
How did you set technical direction for the AI platform? What ADRs did you write?
Consider: LangGraph vs alternatives, Azure OpenAI choice, evaluation framework selection. What was the decision process? How did you document it?
Describe a design review you ran for an AI system. What did you catch?
Consider: Email booking, a new agent, or a platform component. What red flags did you look for? How did you make it constructive, not gatekeeping?
How did you mentor backend engineers into AI development? What was the path?
Consider: Structured learning, pair programming, docs, workshops. The AI Advocate role. Who grew significantly? What did you do differently for them?
Quick Fire Round
Q: What's an ADR?
A: Architecture Decision Record. Documents why we chose X over Y. Future reference and alignment.
Q: Three red flags in an AI design review?
A: No eval plan, no cost estimate, no fallback strategy.
Q: When to build shared platform vs let teams DIY?
A: Shared when multiple teams need it, failure is costly, or consistency matters. DIY when one-off, experimental, or too early.
Q: How do you say no to a shiny new tool?
A: Does it solve a problem we have today? If no, document and defer. If yes, PoC with success criteria.
Q: Translate "95% extraction accuracy" for executives.
A: "19 out of 20 emails are correct without human review."
Q: What's the first step in incident response?
A: Triage—severity and owner. Then mitigate to stop damage.
Q: Build vs buy—when to build?
A: When we have the expertise, long-term cost is lower, and we need control. Otherwise buy/managed for speed.
Q: How do you balance ship fast vs build well?
A: High impact + low complexity = ship fast. High impact + high complexity = scope ruthlessly, build well. Low impact = don't build unless quick.
Q: What to look for in AI engineering candidates?
A: Production experience, system design ability, cost awareness, can explain trade-offs. Red flags: only one framework, no production, dismisses evals.
Q: How do you handle framework churn?
A: Standardize on a small set. Pin versions. Upgrade on a schedule. Don't chase every release.
Q: What's the platform mindset?
A: "How do I make the next team's job easier?" Reduce friction. Compound returns.
Q: How do you scale mentoring?
A: Documentation, workshops, pair on real tasks. Document once, scale to many.
Q: When to rewrite vs patch?
A: Patch when localized, low risk. Rewrite when abstraction is wrong, maintenance cost prohibitive, or tech is dead.
Q: What's a blameless post-mortem?
A: Focus on process and systems, not people. Goal: learn and prevent, not blame.
Q: How do you influence without authority?
A: Data, PoCs, low-friction collaboration. "Here's what we learned—want to try it?" Build consensus.
Key Takeaways
| Topic | Key Point |
|---|---|
| Technical direction | Define stack, ADR decisions, balance ship-fast vs build-well, say no to shiny tools with a bar. |
| Design reviews | Evaluate: model selection, eval plan, cost, failure modes, security. Red flags: no eval, no cost, no fallback. Constructive, not gatekeeping. |
| Platform | Build shared when multiple teams need it, failure costly, consistency matters. Platform mindset: reduce friction for others. |
| Mentoring | Structured path, pair on real tasks, document for scale, workshops. AI Advocate = content + knowledge-sharing. |
| Cross-functional | Translate: product (features), data science (metrics), security (risk), infra (capacity). Set expectations on AI limitations. |
| Hiring | Production > papers. System design, debugging, cost awareness. Red flags: no trade-offs, one framework, no production. |
| Technical debt | Version pinning, upgrade schedule, deprecation strategy. Patch vs rewrite: localized = patch; wrong abstraction = rewrite. |
| Stakeholder comms | Translate metrics to outcomes. Never promise 100%. One-pagers for execs. Outcomes first, tech second. |
| Incidents | Playbook: detect, triage, mitigate, root cause, prevent. Blameless post-mortems. Every P1 → new safeguard. |
| STAR stories | 8 stories. Map to: direction, conflict, build/buy, innovation, mentoring, underperformance, influence, risk. Use Maersk. |
Further Reading
- StaffEng: Stories of Staff Engineers — Real stories from staff+ engineers on influence, technical direction, and leadership. Practical and varied.
- An Elegant Puzzle: Systems of Engineering Management (Will Larson) — Prioritization, technical debt, hiring, organizational design. Highly recommended for leads.
- The Manager's Path (Camille Fournier) — Progression from senior to lead/manager. Relevant for lead engineers.
- RFC 2119: Architecture Decision Records — ADR format and best practices. Lightweight but effective.
- Google: Postmortem Culture — Blameless post-mortems, learning from incidents. From the SRE book.
- Charity Majors: Observability and the Unknown Unknowns — Why observability matters for production systems. Applies to AI.
End of Session 19. This completes the Senior AI Engineer interview preparation guide. You've got the technical depth and the leadership playbook. The jump to Lead isn't a leap—it's the next step. Go show them you can make everyone else's code better.