Session 15: Production Deployment & Infrastructure for AI
Your agent works perfectly in Jupyter. You've tweaked the prompts, the retrieval is solid, and the demo wowed everyone. Now try running it in production—where the OpenAI API goes down at 3 AM, your container runs out of memory, and 500 users hit it simultaneously. Welcome to the fun part.
Production deployment for AI is where the rubber meets the road. The same patterns that made you successful with Kubernetes and Knative—containerization, autoscaling, CI/CD—still apply. But AI adds new failure modes: model APIs that flake, token costs that explode, quality that drifts. And new operational demands: evals in CI, GPU scheduling, fallback chains. This session gets you from "works on my machine" to "works at 2 AM when nobody's watching."
We're not going to hand you a checklist. We're going to give you opinions, war stories, and the kind of answers that make interviewers sit up. You've built a Knative platform. You've shipped an email booking agent at Maersk. This is your playbook.
The difference between a good AI engineer and a senior one? The senior one has been paged at 3 AM. They've seen cost projections miss by 4×. They know why readiness probes exist. If that sounds like you—or where you're headed—read on.
1. Containerizing AI Applications
Interview Insight: They want to see you think about reproducibility, image size, and the model problem. "Docker? I just ran pip install and it worked" is not an answer.
Think of your Docker image like a shipping container for a factory. You don't ship the entire raw materials warehouse—you ship exactly what's needed to produce the goods. Same idea: your 2GB image with every dev dependency and a 70B model baked in is not a shipping container. It's a liability. Multi-stage builds are your best friend: stage one installs build deps and pip packages, stage two copies only the runtime artifacts into a minimal base. A single-stage image can balloon to 1GB+; a well-crafted multi-stage build gets you down to 300–500MB. Faster pulls, faster cold starts, lower storage costs.
Pin every dependency. openai==1.12.0, not openai>=1.0. Reproducible builds prevent "works on my machine" disasters when a transitive dependency updates and breaks everything. Use pip freeze after validating in dev. Add --no-cache-dir to pip install. Use .dockerignore to exclude .git, __pycache__, .env, tests, and large files—you'd be surprised how often secrets or test data slip in. A forgotten .env with OPENAI_API_KEY=sk-... in your image? Congratulations, you just leaked a key to anyone who can pull the image. It happens.
Managing large model files: Don't bake models into the image. A 7B model at FP16 is ~14GB. A 70B model? Forget it. It's like putting the warehouse inside the delivery truck. Pull models at startup from a registry (Hugging Face, S3) or mount them from shared storage. Use ConfigMaps or env vars for model paths. For GPU images, use nvidia/cuda:12.1-runtime-ubuntu22.04 as your base. The nvidia-container-toolkit on the host exposes GPUs; in Kubernetes you request nvidia.com/gpu: 1 in resource limits.
Why This Matters in Production: Bloated images mean slow deployments and unreliable rollbacks. Unpinned deps mean "it worked yesterday" incidents. Models in images mean every model update requires a full rebuild.
Aha Moment: Layer ordering matters for cache efficiency. Copy requirements.txt and run pip install before copying application code. Code changes won't bust your dependency layer—only dependency changes trigger a full pip reinstall.
2. API Design for AI Services (FastAPI)
Interview Insight: They're probing whether you understand async vs sync, streaming, health checks, and production API hygiene. "I use Flask" isn't wrong, but FastAPI's async-native design and built-in validation matter for AI.
FastAPI is the default choice for Python AI APIs—async-native (critical when you're waiting on LLM APIs), automatic OpenAPI docs, Pydantic validation. Your LLM calls and database queries are I/O-bound; async lets you handle many concurrent requests without blocking threads. Define ChatRequest and ChatResponse with Pydantic; invalid requests get 422 with detailed errors. The schema documents itself.
Streaming is non-negotiable for chat. Use Server-Sent Events with StreamingResponse and Content-Type: text/event-stream. Yield chunks as data: {"content": "chunk"}\n\n. Users see tokens as they're generated—perceived latency drops. Handle client disconnects: check request.is_disconnected() in your generator loop so you don't waste tokens on gone users.
Health vs readiness: /health = process alive (liveness probe). /ready = can we actually serve requests? Is the LLM API reachable? Is the vector DB connected? Return 503 if not. Kubernetes uses liveness to restart unhealthy pods and readiness to yank them from service. For AI, readiness must verify you can actually serve—not just that the process started.
Why This Matters in Production: Sync endpoints block; async handles concurrency. Missing readiness checks mean traffic hits pods that can't reach the LLM. No streaming means users stare at a spinner for 10 seconds.
Aha Moment: Use sync for CPU-bound work (local inference, heavy compute) or blocky libraries. Use
run_in_executorif you need both in one endpoint.
3. Kubernetes for AI Workloads
Interview Insight: They want you to talk about resource requests/limits, GPU scheduling, HPA limitations, and when KEDA makes sense. Generic K8s answers don't cut it—AI workloads have specific needs.
You've run Kubernetes. You've probably built a Knative platform. The core concepts are the same—Deployments, rolling updates, Services. The difference: AI workloads are I/O-bound (waiting on LLM APIs), often GPU-hungry, and can scale to zero if you're clever. Set CPU and memory requests and limits. For LLM inference, memory must fit the model—7B needs ~14GB+ for weights and activations. Add headroom or the OOM killer visits.
HPA on CPU fails for LLM workloads. Your CPU sits idle while waiting for the API. Scale on custom metrics instead: request queue length, pending requests, latency. Expose them via Prometheus. KEDA scales on queue depth, HTTP rate, or Prometheus metrics—when requests pile up, scale out; when the queue empties, scale in. KEDA also does scale-to-zero. The trade-off: cold start. Loading a 7B model takes 10–30 seconds. Keep minReplicas if latency matters.
Use node selectors or affinity so GPU workloads land on GPU nodes. ConfigMaps for prompts, model names, feature flags. Secrets for API keys—never in Git. External Secrets operator syncs from Azure Key Vault, AWS Secrets Manager, or Vault.
Real-world analogy: Picture a restaurant kitchen. Liveness is "is the chef breathing?" Readiness is "does the chef have ingredients, a working stove, and a clean grill?" You wouldn't seat customers if the stove is broken—K8s shouldn't send traffic to pods that can't reach their dependencies.
flowchart TB
subgraph clientLayer["Client Layer"]
webApp[Web App]
apiClient[API Client]
end
subgraph gateway["API Gateway"]
loadBalancer[Load Balancer]
rateLimiter[Rate Limiter]
end
subgraph fastApiSvc["FastAPI Service"]
api[FastAPI]
retriever[Retriever]
llmOrch[LLM Orchestrator]
tools[Tool Executor]
end
subgraph external["External Services"]
vectorDb[(Vector DB)]
openAi[OpenAI API]
anthropic[Anthropic API]
end
webApp --> loadBalancer
apiClient --> loadBalancer
loadBalancer --> rateLimiter
rateLimiter --> api
api --> retriever
api --> llmOrch
api --> tools
retriever --> vectorDb
llmOrch --> openAi
llmOrch --> anthropic
tools --> vectorDbWhy This Matters in Production: Wrong resource limits = OOM kills and scheduling failures. CPU-based HPA = under-scaling. No GPU affinity = GPU pods on CPU nodes.
Aha Moment: Readiness probes that only check process health are useless for AI. Your pod is "ready" when it can actually call the LLM and retrieve from the vector store.
4. GPU Sizing and Selection
Interview Insight: They want practical rules of thumb, not theory. "How do you know if a model fits?" and "What about quantization?" are fair game.
Simple rule: FP16 model needs ~2× parameters in GB of VRAM (7B ≈ 14GB). Quantization cuts that: INT4 (Q4) needs ~0.5× (7B ≈ 3.5–4GB), INT8 ~1×. Think of it like packing a suitcase—FP16 is everything unfolded, Q4 is rolling your clothes. Same stuff, less space, slight trade-off in quality. Context length adds memory—8K context adds ~1GB per 10B params; 128K can double VRAM. The KV cache eats dynamic memory during generation.
Small (7–8B): 16GB GPUs (T4, A10G) at FP16, or 4–6GB at Q4. Good for simple chat, classification. Medium (13–34B): 24–40GB (A10G, A100-40GB) at FP16. Complex reasoning, RAG. Large (70B+): 80GB+ (A100-80GB, H100) or multi-GPU. Cost: T4 cheapest, A10G balanced, A100 best throughput, H100 latest and priciest. Spot instances can cut cost 60–70% with interruption risk.
Batch size: larger = higher throughput, more memory. For interactive latency, use batch 1. vLLM's continuous batching improves GPU utilization by dynamically batching requests with different sequence lengths.
Why This Matters in Production: Undersized GPU = OOM or no-load. Oversized = money burning. Wrong quantization = quality vs cost trade-off you didn't plan for.
Aha Moment: "Will it fit?" is the first question. FP16 params × 2 in GB. Then add headroom for activations and KV cache.
5. CI/CD for AI Applications
Interview Insight: This is where you differentiate. Standard CI/CD isn't enough—you need evals and quality gates. "We run pytest" is weak. "We run evals and block merges that regress faithfulness" is strong.
Standard CI—lint, test, build, deploy—must be extended. Code changes can break prompt behavior. A "harmless" refactor can subtly change how your system prompt gets injected. A dependency update can shift tokenization. You need to catch that before it hits prod. Model or prompt updates can degrade quality. Your CI must validate quality, not just functionality. Run an eval suite on every PR: golden dataset, input-output pairs or inputs with expected properties. Compute faithfulness, relevancy, correctness. If quality drops below threshold, block the merge. Tools: Promptfoo, Braintrust, LangSmith.
Canary deployments: route 5–10% of traffic to the new version. Monitor quality and latency. Gradually increase to 100% or roll back. Feature flags let you toggle prompts or models without redeploying. Rollback must be automated—if quality drops in prod, trigger rollback. Manual rollback is too slow.
flowchart LR
subgraph code["Code"]
pr[Pull Request]
merge[Merge]
end
subgraph pipeline["CI Pipeline"]
lint[Lint]
test[Unit Tests]
eval[Eval Suite]
build[Build Image]
deploy[Deploy]
end
subgraph gate["Quality Gate"]
qualityCheck{Quality >= Threshold?}
end
pr --> lint
lint --> test
test --> eval
eval --> qualityCheck
qualityCheck -->|Pass| build
qualityCheck -->|Fail| block[Block Merge]
build --> deploy
deploy --> mergeWhy This Matters in Production: A "harmless" prompt tweak can tank faithfulness. Deploying without evals is playing Russian roulette with user trust.
Aha Moment: Treat your golden eval dataset as a first-class artifact. Version it. When prompts change, re-run. It's your regression test for AI.
6. Reliability Patterns
Interview Insight: LLM APIs fail in weird ways. They want to hear retries, fallbacks, circuit breakers, and idempotency. "We just retry" is naive.
LLM APIs rate-limit (429), timeout (504), go down for hours. Retries alone can create retry storms—everyone hammering a failing API. Circuit breaker: after N consecutive failures, stop calling. "Open" the circuit, fail fast. After cooldown, allow one test request (half-open). Success = close; fail = stay open. Prevents wasted tokens and blocked threads.
Retry with exponential backoff: 1s, 2s, 4s, 8s (cap ~60s). Add jitter. Retry only on 429, 502, 503, 504—never 400, 401, 404. Respect Retry-After headers. Fallback chain: primary (GPT-4o) → Claude → Gemini → local. When primary fails, try next. Quality may drop, but users get a response. Circuit breakers help: if GPT-4o has been failing for 5 minutes, skip it.
Idempotency for agents that do things: booking creation, email sends, DB writes. Use idempotency keys—client sends unique key per logical op; server deduplicates. Retries after timeout won't double-book. Dead letter queues for failed async jobs—inspect, fix, replay. Timeouts: 30s typical for LLM calls. Graceful degradation: if retrieval is down, answer with disclaimer.
stateDiagram-v2
[*] --> closed
closed --> open: Failure count >= threshold
open --> halfOpen: Timeout elapsed
halfOpen --> closed: Test request succeeds
halfOpen --> open: Test request failsWhy This Matters in Production: Without circuit breakers, one API outage cascades into timeouts across all requests. Without idempotency, retries create duplicate bookings. Without fallbacks, you're down when OpenAI is.
Aha Moment: Circuit breaker + fallback chain = you stay up when any single provider fails. Design for partial availability.
flowchart LR
subgraph request["Incoming Request"]
req[Request]
end
subgraph fallbackChain["Fallback Chain"]
primary[Primary: GPT-4o]
secondary[Claude]
tertiary[Gemini]
local[Local Model]
end
req --> primary
primary -->|Fail| secondary
secondary -->|Fail| tertiary
tertiary -->|Fail| local
primary -->|Success| resp[Response]
secondary -->|Success| resp
tertiary -->|Success| resp
local -->|Success| resp7. Infrastructure as Code for AI
Interview Insight: Terraform, Helm, secrets management. They want to know you don't hand-roll YAML or store API keys in env files.
Terraform provisions cloud resources: Azure OpenAI endpoints, vector DBs, Kubernetes clusters, storage. Define in HCL, apply, version in Git. Helm charts templatize K8s manifests—Deployment, Service, ConfigMap, HPA, Ingress. Parameterize via values.yaml. Deploy with helm upgrade --install. Dev gets smaller models, relaxed limits; prod gets full models, strict limits.
Secrets: Never in code. Never in plain env vars in Git. Azure Key Vault, AWS Secrets Manager, Vault. External Secrets operator syncs vault contents into Kubernetes Secrets—the app never sees the raw vault, just a mounted Secret. Rotate regularly. Use short-lived creds where possible. Separate keys for dev/staging/prod. At Maersk we had separate OpenAI keys per environment so a prod incident didn't burn through our dev quota.
Why This Matters in Production: Manual infra = drift, inconsistency, "it works on staging." Secrets in Git = breach.
Aha Moment: IaC isn't optional for production. You need repeatable, auditable deployment. Helm + Terraform + External Secrets = professional setup.
8. Deployment Architectures
Interview Insight: Monolith vs microservices—when do you split? "Microservices for everything" is cargo cult. "Start monolithic, split when it hurts" is pragmatic.
Monolith: API, retrieval, generation, tools in one service. Simple. Good for small teams, early stage. Can't scale retrieval and generation independently. One bug takes down everything. Microservices: Split retrieval, generation, tool execution. Scale each independently. Retrieval on CPU nodes, generation on GPU nodes. Failure isolation. More to deploy, monitor, debug.
Serverless: Good for webhooks, batch, event-driven. Cold start kills self-hosted inference—loading a model in Lambda can exceed timeouts. Use for lightweight tasks (embeddings, classification) or API-calling workloads. When to choose: Start monolithic. Split when scaling needs diverge, team size justifies ownership, or failure isolation matters.
flowchart TB
subgraph monolith["Monolithic"]
mono[Single Service]
monoRetriever[Retriever]
monoLlm[LLM]
monoTools[Tools]
mono --> monoRetriever
mono --> monoLlm
mono --> monoTools
end
subgraph microservices["Microservices"]
apiSvc[API Service]
retrieverSvc[Retriever Service]
llmSvc[LLM Service]
toolsSvc[Tools Service]
apiSvc --> retrieverSvc
apiSvc --> llmSvc
apiSvc --> toolsSvc
endWhy This Matters in Production: Over-architecting early wastes time. Under-architecting at scale creates bottlenecks. Match complexity to need.
Aha Moment: The split point is usually "we need to scale X but not Y" or "team A owns retrieval, team B owns generation."
9. Monitoring: Before Go-Live Checklist
Interview Insight: Standard metrics aren't enough. They want AI-specific metrics—quality, tokens, cost. The $340K vs $80K projected cost story from 2025 is real.
Application: Request rate, error rate, P50/P95/P99 latency. AI-specific: Faithfulness, relevancy (from evals), guardrail trigger rate, token usage per request, cost per request. Cost: Daily spend, cost per request, token usage by model. Budgets and alerts. One startup hit $340K actual vs $80K projected—cost monitoring is critical.
Alerting: Error rate > 5%, P95 > SLA, faithfulness < 0.8, cost anomaly > 120% of daily avg. Tune to avoid fatigue. Logging: Structured JSON, request IDs. No PII. Sampling for high volume. Load test before launch—k6, Locust. Simulate 2–3× peak traffic. Find bottlenecks. We've seen teams skip this and discover their "simple" agent hits 10s P99 under load because every request was doing full RAG retrieval with no caching. Fix it before users complain.
Why This Matters in Production: You can't fix what you can't see. Cost blindness kills projects. Quality drift without evals is silent failure.
Aha Moment: Add quality and cost to your dashboard on day one. They're not "nice to have"—they're how you know if the system works.
Code Examples
Multi-Stage Dockerfile for FastAPI AI Service
# Stage 1: Builder
FROM python:3.12-slim AS builder
WORKDIR /app
RUN apt-get update && apt-get install -y --no-install-recommends build-essential \
&& rm -rf /var/lib/apt/lists/*
RUN python -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
COPY requirements.txt .
RUN pip install --no-cache-dir --upgrade pip && pip install --no-cache-dir -r requirements.txt
# Stage 2: Runtime
FROM python:3.12-slim AS runtime
WORKDIR /app
COPY --from=builder /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
RUN useradd -m -u 1000 appuser && chown -R appuser:appuser /app
USER appuser
COPY --chown=appuser:appuser . .
ENV MODEL_PATH=/models
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]FastAPI with Streaming, Health, and Schemas
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field
import httpx
import json
app = FastAPI(title="AI Chat API")
class ChatMessage(BaseModel):
role: str = Field(..., pattern="^(system|user|assistant)$")
content: str
class ChatRequest(BaseModel):
messages: list[ChatMessage]
model: str = "gpt-4o-mini"
max_tokens: int = Field(default=1024, ge=1, le=4096)
@app.get("/health")
def health():
return {"status": "ok"}
@app.get("/ready")
async def ready():
try:
async with httpx.AsyncClient() as client:
r = await client.get("https://api.openai.com/v1/models", timeout=5.0)
if r.status_code == 200:
return {"status": "ready"}
except Exception as e:
raise HTTPException(status_code=503, detail=f"LLM API unreachable: {e}")
@app.post("/chat/stream")
async def chat_stream(request: ChatRequest):
async def generate():
async with httpx.AsyncClient() as client:
async with client.stream(
"POST", "https://api.openai.com/v1/chat/completions",
json={"model": request.model, "messages": [m.model_dump() for m in request.messages],
"max_tokens": request.max_tokens, "stream": True},
timeout=60.0,
) as response:
async for line in response.aiter_lines():
if line.startswith("data: "):
data = line[6:]
if data == "[DONE]":
break
try:
chunk = json.loads(data)
content = chunk.get("choices", [{}])[0].get("delta", {}).get("content", "")
if content:
yield f"data: {json.dumps({'content': content})}\n\n"
except json.JSONDecodeError:
pass
return StreamingResponse(
generate(),
media_type="text/event-stream",
headers={"Cache-Control": "no-cache", "Connection": "keep-alive"},
)Circuit Breaker + Retry with Tenacity
import asyncio
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception
import httpx
def is_retryable(e):
if isinstance(e, httpx.HTTPStatusError):
return e.response.status_code in (429, 500, 502, 503, 504)
return isinstance(e, (httpx.TimeoutException, httpx.ConnectError))
@retry(
stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=1, min=1, max=60),
retry=retry_if_exception(is_retryable),
reraise=True,
)
async def call_llm_with_retry(messages: list, model: str = "gpt-4o") -> str:
async with httpx.AsyncClient() as client:
r = await client.post(
"https://api.openai.com/v1/chat/completions",
json={"model": model, "messages": messages},
timeout=30.0,
)
if r.status_code == 429 and "Retry-After" in r.headers:
await asyncio.sleep(int(r.headers.get("Retry-After", 5)))
r.raise_for_status()
return r.json()["choices"][0]["message"]["content"]Conversational Interview Q&A
Q1: "How do you deploy an AI agent to production? Walk through your CI/CD pipeline."
Weak answer: "We use Docker and Kubernetes. CI runs tests and builds the image, then we deploy to our cluster."
Strong answer: "At Maersk, we run a full pipeline: lint, unit tests, and an eval suite on every PR. The eval suite has golden scenarios—booking requests, cancellations, ambiguous queries. We measure faithfulness and relevancy. If either drops below threshold, we block the merge. That's the key differentiator: we don't ship prompt or model changes without quality validation. Build is multi-stage Docker, pinned deps, no models baked in. We deploy with Helm, rolling updates, and readiness probes that verify LLM API connectivity. For riskier changes, we've used canary deployments—5% traffic first, monitor, then ramp. Automated rollback if quality or error rate spikes in prod."
Q2: "How do you handle LLM API outages?"
Weak answer: "We retry a few times. If it keeps failing, we return an error."
Strong answer: "Retries with exponential backoff for 429, 502, 503, 504—with jitter, respecting Retry-After. But retries aren't enough. We use a fallback chain: primary model, then secondary provider. For the email booking agent at Maersk, if our primary is down or rate-limited, we fail over. Circuit breaker prevents us from hammering a failing API—after N failures we open the circuit, fail fast, try again after cooldown. We also set hard timeouts (30s) so hung requests don't block capacity. The goal: partial availability. Users get a response even when one provider is having a bad day."
Q3: "What's the difference between /health and /ready? Why does it matter?"
Weak answer: "Health checks if the app is up. Readiness checks if it can serve traffic."
Strong answer: "Liveness (/health) tells Kubernetes the process is alive—if it fails, K8s restarts the pod. Readiness (/ready) tells K8s we can actually serve requests. For AI, that means: can we reach the LLM API? Is the vector DB connected? Is the model loaded? If readiness fails, we get yanked from the load balancer—no traffic until we're actually ready. At Maersk, our booking agent's readiness checks LLM and email service connectivity. Without that, we'd get traffic during partial outages and return 503s to users. Readiness is your 'don't send me work I can't do' signal."
Q4: "How do you scale an AI service? CPU-based HPA?"
Weak answer: "We use HPA on CPU. When CPU goes up, we add replicas."
Strong answer: "CPU-based HPA fails for LLM workloads—we're I/O-bound waiting on the API, so CPU stays low. We scale on custom metrics: queue depth, pending requests, or latency. At Maersk we've used KEDA for queue-based scaling—when requests pile up, scale out; when the queue empties, scale in. For HTTP services, we expose a pending-requests metric. We also set minReplicas to avoid cold starts for latency-sensitive chat. GPU workloads get node affinity so they land on GPU nodes. Building a Knative platform taught me that scale-to-zero is great for batch, but chat needs warm pods."
Q5: "How do you manage API keys and secrets for LLM providers?"
Weak answer: "We use environment variables. They're set in our deployment config."
Strong answer: "Never in code, never in Git. We use Azure Key Vault (or AWS Secrets Manager in AWS). External Secrets operator syncs vault contents into Kubernetes Secrets. The app gets keys via env or mounted files. We use separate keys for dev, staging, prod. Rotate quarterly—we have a process to update the vault and redeploy. At Maersk, we ran secret scanning in CI to catch accidental commits. Short-lived credentials where the provider supports it. The rule: if it's in a repo, assume it's leaked."
Q6: "When do you choose microservices over a monolith for AI?"
Weak answer: "Microservices are better for scale. We always use them."
Strong answer: "Start monolithic. Split when it hurts. At Maersk, our email booking agent started as a single FastAPI service—API, retrieval, generation, tools. Simple to develop and deploy. We'd split if (a) retrieval and generation have different scaling needs—e.g., retrieval is CPU-bound, generation needs GPUs, (b) team structure means different owners for different parts, or (c) we need failure isolation—retrieval down shouldn't take down the whole agent. The Knative platform I built was event-driven and service-based, but that was for broader platform needs. For a single AI agent, monolith first. Complexity has a cost."
Q7: "What monitoring do you set up before going live?"
Weak answer: "We have Prometheus and Grafana. We track request count and errors."
Strong answer: "Standard metrics: request rate, error rate, P50/P95/P99. But for AI we add quality and cost. We track faithfulness and relevancy—either real-time lightweight evals or sampled batches. Guardrail trigger rate. Token usage per request, cost per request. At Maersk we had dashboards for daily spend and cost anomalies. One alert: if today's spend is 2× the 7-day average. We set thresholds for error rate > 5%, P95 > SLA, faithfulness < 0.8. Structured logs with request IDs for tracing. Load test before launch—2–3× peak traffic. The $340K vs $80K story is a caution: cost blindness kills."
From Your Experience
Prompt 1 – Deploying the Maersk email booking agent: "At Maersk we containerized the email booking agent as a FastAPI service deployed to Kubernetes. CI ran lint, unit tests, and evals on golden email scenarios—booking requests, cancellations, ambiguity. We used a multi-stage Dockerfile, pinned deps, no models in the image. API keys came from Key Vault via External Secrets. Readiness probe checked LLM and email service connectivity. Rolling deploys, 2 replicas. Retries and fallback chain for LLM outages. Idempotency keys for booking creation so retries don't double-book. After a provider outage caused retry storms, we added a circuit breaker."
Prompt 2 – Kubernetes/Knative experience applied to AI: "Building a Knative platform gave me scale-to-zero and request-driven scaling. For AI, KEDA provides similar event-driven autoscaling on queue depth or custom metrics. ConfigMaps, Secrets, resource limits—same patterns. The new dimension is GPU scheduling: node affinity, GPU memory sizing, ensuring AI pods land on the right nodes. Observability extends—traces, metrics, logs—but AI adds quality and cost as first-class metrics. Canary, rollback, incident response: same discipline. The eval gate in CI is the main addition for AI."
Prompt 3 – Enterprise AI Agent Platform reliability: "On the Enterprise AI Agent Platform at Maersk we implemented retries with exponential backoff (tenacity, 3–5 attempts), fallback chains across providers, circuit breakers after we saw retry storms, and 30s timeouts on LLM calls. Idempotency for any agent action that modifies state. Dead letter queues for failed async jobs. Graceful degradation: if retrieval is down, we return a disclaimer with general knowledge. The goal: stay up when components fail, degrade gracefully, never lose data."
Quick Fire Round
- Multi-stage build purpose? Smaller image, faster pulls, cache efficiency.
- Why pin deps in requirements.txt? Reproducibility; prevent "works yesterday" breaks.
- Models in Docker image? No. Too big, changes independently. Pull at startup or mount.
- /health vs /ready? Liveness = process alive. Readiness = can actually serve (deps up).
- Why async for LLM endpoints? I/O-bound; async handles concurrency without blocking.
- HPA on CPU for LLM? Fails. I/O-bound, CPU stays low. Use queue depth, pending requests.
- KEDA good for? Event-driven scaling on queue, HTTP rate, custom metrics; scale-to-zero.
- Circuit breaker states? Closed → Open (failures) → Half-open (test) → Closed or Open.
- When to retry LLM? 429, 502, 503, 504. Not 400, 401, 404.
- Idempotency key? Unique per logical op; server deduplicates retries.
- Eval in CI? Golden dataset, quality metrics; block merge if below threshold.
- FP16 VRAM rule? ~2× params in GB. 7B ≈ 14GB.
- Q4 quantization VRAM? ~0.5× params. 7B ≈ 3.5–4GB.
- Secrets in Git? Never. Use vault + External Secrets.
- When to split monolith? Different scaling needs, team ownership, failure isolation.
Key Takeaways
| Topic | Takeaway |
|---|---|
| Containerization | Multi-stage build, pin deps, .dockerignore, never bake models; GPU = CUDA base + nvidia.com/gpu |
| FastAPI | Async for I/O, Pydantic validation, SSE streaming, /health + /ready, structured errors |
| Kubernetes | Requests/limits, GPU affinity, HPA on custom metrics (not CPU), KEDA for queue/scale-to-zero |
| GPU Sizing | FP16 ≈ 2× params GB; Q4 ≈ 0.5×. Match model to VRAM. |
| CI/CD for AI | Evals + quality gate; block merges that regress; canary, feature flags, auto-rollback |
| Reliability | Retries + backoff, fallback chain, circuit breaker, idempotency, timeouts, DLQ |
| IaC | Terraform + Helm; secrets from vault via External Secrets |
| Architecture | Start monolithic; split when scaling/ownership/isolation demands it |
| Monitoring | Rate, errors, latency + quality, tokens, cost. Alerts on quality and cost. |
Further Reading
- Docker: Multi-stage builds — official docs on slimming images
- FastAPI: Official docs — async patterns, StreamingResponse, health checks
- KEDA: KEDA documentation — event-driven autoscaling for Kubernetes
- CI/CD for AI: "A Practical Guide to Integrating AI Evals into Your CI/CD Pipeline" — eval gates and quality thresholds
- Reliability: "Retries, Fallbacks, and Circuit Breakers in LLM Apps" (Portkey, Maxim AI) — production patterns
- GPU Scheduling: NVIDIA GPU Operator — GPU scheduling on Kubernetes
- LLM Deployment: "The Practical Guide to Deploying LLMs" (Seldon) — end-to-end deployment patterns