AI Engineer Prep

Session 14: Performance Optimization — Latency, Cost & Caching

Your agent takes 12 seconds to respond. The user has already closed the tab. Performance optimization isn't a nice-to-have—it's the difference between a demo and a product that people actually use. Nobody cares about your clever ReAct loop if they're staring at a spinner long enough to make a coffee.

We learned this the hard way at Maersk. When we built AI email booking automation and the Enterprise AI Agent Platform, we discovered that a beautiful, well-architected agent that nails tool calls 99% of the time is worthless if shipping coordinators tap out after 8 seconds and pick up the phone instead. Same story for cost: you can design the most elegant RAG pipeline on the planet, but if it burns $50K a month in token fees, finance will shut it down. This session is about the levers you actually pull—making AI systems fast, cheap, and good enough. In that order. Because you rarely get all three.

Here's the uncomfortable truth: most engineers treat performance as an afterthought. They ship the feature, then someone notices the bill or the latency, and suddenly it's an emergency. The Senior AI Engineer mindset is different. You bake latency and cost into your design from day one. You know your thresholds. You have a plan. Let's build that plan.


1. The Performance Triangle

Every AI system sits on a three-legged stool: latency, cost, and quality. You can optimize any two. The third pays the price. Stakeholders will ask for all three. Your job is to make the trade-offs explicit and defend them.

Interview Insight: Interviewers want to know you understand that "just make it faster and cheaper" isn't a strategy—it's a negotiation. You pick which dimension to sacrifice and why.

Think of it like ordering at a restaurant. Fast and cheap? You get the microwave special. Fast and good? Prepare to pay. Cheap and good? It'll take an hour. Same thing here. A cheaper model is faster and cheaper but flubs complex reasoning. A better model is higher quality but burns tokens and takes longer. Semantic caching improves latency and cost (instant hits, near-zero cost) but can degrade quality if your similarity threshold is too loose and you return the wrong cached answer.

Why This Matters in Production: Strategic optimization can cut LLM costs 60–80% while maintaining quality. Research shows inference costs can drop up to 98% with routing, caching, and distillation—but only if you know which levers to pull for your use case.

Aha Moment: The triangle isn't a constraint—it's a design tool. Decide up front: "For this feature we optimize for latency and cost; quality can be 90%." That clarity prevents scope creep and endless "can we make it better?" loops.


2. Model Routing

Model routing is sending the right query to the right model. Instead of blasting every request at GPT-4o, you keep a pool—mini for simple stuff, full fat for hard stuff—and route each prompt to the smallest model that can handle it.

Interview Insight: Routing is the single biggest cost lever. GPT-4o-mini costs ~$0.15/M input tokens; GPT-4o costs ~$2.50/M. That's a 10–20x gap. A well-tuned router can cut costs 50–85% with minimal quality loss.

It's like triage in an ER. Not everyone needs the senior consultant. A cut finger gets a nurse; chest pain gets the cardiologist. Your classifier does the triage: simple (FAQ, greetings, yes/no) → mini; complex (multi-hop reasoning, policy interpretation, nuance) → full model. Confidence-based routing tries the small model first; if it's unsure, escalate. Research from 2025: routers can hit 95% of GPT-5.2 quality while using the expensive model for only 26% of requests.

Why This Matters in Production: For our Enterprise AI Agent Platform at Maersk, we routed "what's the status of container X?" to mini and "explain tariff discrepancies across three shipments" to the full model. The routing overhead is ~11µs with modern gateways—negligible. The savings are not.

Aha Moment: The hard part isn't the routing logic—it's defining "simple" vs "complex" in a way that generalizes. Domain matters. A "simple" question in logistics might be complex in legal. Label data, tune, repeat.


3. Semantic Caching

Traditional caching needs an exact key match. "What is the refund policy?" and "What's the refund policy?" = two cache misses. Semantic caching treats them as the same intent. Embed both, compare, return the cached answer if they're similar enough.

Interview Insight: Cache hit rates of 40–70% are realistic for repetitive workloads. That's 40–70% of requests that never touch the LLM. API call reduction up to 68.8%, response time 100x faster on hits.

Think of it like a librarian who remembers the spirit of your question. You ask "how do I return something?" and she recalls answering "what's your return policy?" yesterday. Same answer. Semantic caching does that at scale: embed query → search cache → if similarity ≥ threshold (0.92–0.95), return cached; else call LLM and cache the new pair. GPTCache is the go-to open-source library—plugs into Redis, Chroma, Milvus, LangChain.

Why This Matters in Production: Customer support, FAQ bots, internal knowledge bases—repetitive, paraphrased queries everywhere. Semantic caching shines. The killer: cache invalidation. Policy update? Stale answers. Use TTL, event-based invalidation, or namespace partitioning.

Aha Moment: Threshold is everything. 0.99 = barely any hits. 0.80 = wrong answers for similar-but-different queries ("return policy" vs "return address"). 0.93 is a good starting point; tune with real data.


4. Token Budget Management

Tokens are the currency of LLM APIs. More tokens = more cost + more latency. Input tokens are usually cheaper than output (3–5x), but both count.

Interview Insight: You can cut token usage 60–80% without hurting quality. Verbose system prompts, over-retrieval, unnecessary chain-of-thought—death by a thousand cuts.

It's like packing for a trip. You can cram everything or pack smart. Shorter prompts, fewer few-shot examples, truncate context, compress retrieved chunks. Context caching (Anthropic, OpenAI) lets you cache long static sections (system prompts, style guides) at 50–90% lower rates—huge for repeated content. For output: set max_tokens, ask for concise answers, use structured JSON. Monitor token usage per request, per model, per feature. Outliers matter: one request at 50K tokens when the average is 2K will skew your bills.

Why This Matters in Production: At Maersk we tracked tokens per agent, per customer. Daily budgets, alerts at 50% and 80%. Hidden costs (API overhead, data transfer) add 15–30%—factor them in.

Aha Moment: Don't ask for "step-by-step reasoning" when a direct answer suffices. Don't retrieve 10 chunks when 3 would do. Every token is a choice.


5. Streaming Responses

Streaming returns tokens as they're generated. Total generation time is the same, but perceived latency plummets. Users see text in hundreds of milliseconds instead of staring at a blank screen for seconds.

Interview Insight: TTFT (time to first token) is the UX metric that matters for chat. Streaming is non-negotiable for interactive interfaces.

It's the difference between a loading spinner and a typing indicator. Same wait—different psychology. Use SSE (Server-Sent Events) or WebSockets. OpenAI, Anthropic, everyone supports stream=True. FastAPI: StreamingResponse with an async generator. Track both TTFT (responsiveness) and TTLT (total cost, backend load). Caveat: you can't validate the full response before streaming—guardrails get trickier. Stream to buffer, validate, then stream to user, or accept brief exposure.

Why This Matters in Production: For email booking automation, first-token latency was the difference between "this feels broken" and "this is working." Users tolerate a stream; they don't tolerate 8 seconds of nothing.

Aha Moment: Streaming doesn't speed up the model. It changes perception. Sometimes that's enough.


6. Batching Strategies

Batching = group requests, process together, amortize overhead. Not for real-time chat. For document processing, evaluations, extraction, nightly scoring.

Interview Insight: OpenAI's Batch API gives 50% discount. Submit JSONL, get results in 24 hours. Ideal for thousands of tickets, document classification, embedding jobs.

Like carpooling. One trip, multiple passengers. Custom batching: queue requests, process in groups of 10–50, return via webhooks or polling. vLLM's continuous batching dynamically batches requests with different lengths—better GPU utilization for self-hosted. Batching is wrong for interactive agent loops where each step depends on the previous.

Why This Matters in Production: Evaluation runs, content moderation at scale, bulk document extraction—batch is your friend. Save the real-time path for the user-facing agent.

Aha Moment: Batch = cheap but slow. Real-time = expensive but fast. Know which workload you're optimizing for.


7. Quantization

Quantization shrinks model weights: FP32 → FP16 → INT8 → INT4. Smaller = less memory, faster inference, some quality loss.

Interview Insight: A 7B model at FP16 needs ~14GB; at 4-bit, ~4GB. Fits on consumer hardware. Q4 for most apps; Q8 when quality matters; FP16 when it really matters.

GGUF (llama.cpp, Ollama): dominant format, Q4_K_M / Q5_K_M / Q8_0. GPTQ: Hessian-based, great throughput, aggressive compression. AWQ: protects salient weights, ~95% creative fidelity at 4-bit—best for quality-sensitive GPU deployments. AWQ leads on quality, GPTQ on throughput, GGUF on ease of deployment (2025–2026).

Why This Matters in Production: Self-hosted, edge, GPU-constrained? Quantization is how you fit the model. For API-only, you care less—but if you ever go local, this is the playbook.

Aha Moment: Q4 is fine for classification, extraction, simple Q&A. Complex reasoning? Bump to Q5 or Q8. Don't over-quantize.


8. Model Distillation

Distillation = train a small "student" on a large "teacher"'s outputs. Run teacher on your data, collect pairs, fine-tune student. Result: 10–100x smaller, 2–5x faster, almost as good on your narrow domain.

Interview Insight: Perfect for high-volume, narrow tasks. Customer support for one product, intent classification, entity extraction. Fine-tuning GPT-4o-mini on GPT-4o outputs = distillation.

Amazon Bedrock's Model Distillation: up to 500% faster, 75% cheaper, <2% accuracy loss. Caveat: homogenization—diversity drops, out-of-distribution handling can suffer.

Why This Matters in Production: When a general-purpose giant is overkill and you have volume, distill. One model for one job.

Aha Moment: Distillation locks you into a task. If the task drifts, the student drifts. Retrain when domains evolve.


9. Async Execution and Parallelism

Sequential LLM calls are slow. Five tool calls, each waiting for the previous = latency adds up. Independent calls? Run them in parallel with asyncio.gather.

Interview Insight: Parallelism is the biggest latency win for agents. Sum of sequential vs max of parallel—often 3–5x faster.

Fetch weather and calendar in parallel. Multiple retrieval queries at once. Parallel evaluation of 100 outputs. Use a semaphore to cap concurrent requests (respect rate limits). Retries with exponential backoff for 429s.

Why This Matters in Production: Our email booking agent had calendar, email, and availability tool calls. We parallelized the independent ones. p95 latency: 8s → 3s.

Aha Moment: Map the dependency graph. What must wait for what? Maximize parallel waves.


10. Connection Pooling and HTTP Optimization

HTTP connection setup has overhead. Reuse connections. httpx with pooling, requests.Session, HTTP/2 multiplexing. Can shave 50–200ms per request.

Interview Insight: Boring but real. High-volume = connection churn. Pool size should match concurrency.

Why This Matters in Production: When you're squeezing every millisecond, connection reuse matters.

Aha Moment: It's the last 5% of latency—but sometimes that 5% is the difference.


11. Edge and Local Deployment

Run locally or at the edge: no API cost, no network hop. Ollama = simplest, API-compatible, dev-friendly. vLLM = production-grade, continuous batching, PagedAttention, prefix caching, distributed inference.

Interview Insight: Local when: privacy (data can't leave), latency (sub-100ms), volume (API costs exceed GPU costs), offline/air-gapped.

Trade-offs: you own infra, GPU memory limits model size, no auto-scaling. API = flexible, low ops, pay per token. Rule of thumb: <10M tokens/month + no data residency? API + optimization wins. 100M+ or strict residency? Self-host.

Why This Matters in Production: Maersk has data residency requirements. Some workloads never touch the cloud.

Aha Moment: "Local vs API" isn't a one-time choice. It's per-workload, per-region, per-compliance-boundary.


Mermaid Diagrams

Visual learners, rejoice. These diagrams map the concepts below. No colors, no fluff—just the flow.

Model Routing Flow

flowchart TD
    userQuery[User Query]
    router[Router / Classifier]
    simpleCheck{Simple?}
    mini[GPT-4o-mini]
    full[GPT-4o]
    response[Response]
 
    userQuery --> router
    router --> simpleCheck
    simpleCheck -->|Yes| mini
    simpleCheck -->|No| full
    mini --> response
    full --> response

Semantic Cache Flow

flowchart TD
    query[Query]
    embed[Embed Query]
    search[Vector Search Cache]
    hitCheck{Similarity >= 0.93?}
    cacheHit[Return Cached]
    callLLM[Call LLM]
    storeCache[Store in Cache]
    returnResp[Return Response]
 
    query --> embed
    embed --> search
    search --> hitCheck
    hitCheck -->|Yes| cacheHit
    hitCheck -->|No| callLLM
    callLLM --> storeCache
    storeCache --> returnResp

Performance Optimization Stack

flowchart TB
    subgraph request[Request Layer]
        req[Incoming Request]
    end
 
    subgraph routing[Routing Layer]
        route[Model Router]
        mini[Small Model]
        large[Large Model]
    end
 
    subgraph cache[Cache Layer]
        check[Semantic Cache]
        hit[Hit]
        miss[Miss]
    end
 
    subgraph exec[Execution Layer]
        stream[Streaming]
        tokens[Token Budget]
    end
 
    req --> check
    check -->|hit| hit
    check -->|miss| route
    route --> mini
    route --> large
    mini --> stream
    large --> stream
    stream --> tokens

Agent Latency: Sequential vs Parallel

flowchart LR
    subgraph sequential[Sequential]
        s1[Tool 1]
        s2[Tool 2]
        s3[Tool 3]
        s1 --> s2 --> s3
    end
 
    subgraph parallel[Parallel]
        fork((Start))
        p1[Tool 1]
        p2[Tool 2]
        p3[Tool 3]
        join((Done))
        fork --> p1
        fork --> p2
        fork --> p3
        p1 --> join
        p2 --> join
        p3 --> join
    end

Quantization Trade-offs

flowchart LR
    fp32[FP32] --> fp16[FP16]
    fp16 --> q8[Q8]
    q8 --> q4[Q4]
 
    fp32 -.->|quality| high[High]
    q4 -.->|quality| lower[Lower]
    q4 -.->|speed| fast[Faster]

Code Examples

Copy-paste and adapt. These are production patterns we used at Maersk—tweak the models and thresholds for your setup.

Model Routing with Complexity Classifier

from openai import OpenAI
 
client = OpenAI()
 
def classify_complexity(query: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Classify as 'simple' or 'complex'. Simple: FAQ, greetings, yes/no. Complex: reasoning, multi-step, nuanced."},
            {"role": "user", "content": query}
        ],
        max_tokens=10,
        temperature=0
    )
    return response.choices[0].message.content.strip().lower()
 
def route_and_complete(query: str) -> str:
    complexity = classify_complexity(query)
    model = "gpt-4o-mini" if "simple" in complexity else "gpt-4o"
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": query}],
        temperature=0.7
    )
    return response.choices[0].message.content
 
# Usage
answer = route_and_complete("What is your refund policy?")  # -> mini
answer = route_and_complete("Compare our refund policy with Amazon's and suggest improvements")  # -> gpt-4o

Semantic Caching with Vector Similarity

from openai import OpenAI
import numpy as np
from typing import Optional
 
client = OpenAI()
CACHE: list[tuple[np.ndarray, str]] = []
SIMILARITY_THRESHOLD = 0.93
 
def embed(text: str) -> np.ndarray:
    resp = client.embeddings.create(model="text-embedding-3-small", input=text)
    return np.array(resp.data[0].embedding)
 
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
 
def get_cached_response(query: str) -> Optional[str]:
    q_emb = embed(query)
    for c_emb, c_resp in CACHE:
        if cosine_similarity(q_emb, c_emb) >= SIMILARITY_THRESHOLD:
            return c_resp
    return None
 
def complete_with_cache(query: str) -> str:
    cached = get_cached_response(query)
    if cached:
        return cached
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": query}],
        temperature=0
    )
    answer = resp.choices[0].message.content
    CACHE.append((embed(query), answer))
    return answer

Async Parallel Tool Calls

import asyncio
from openai import AsyncOpenAI
 
client = AsyncOpenAI()
 
async def fetch_weather(location: str) -> str:
    resp = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"Weather in {location}?"}],
        max_tokens=100
    )
    return resp.choices[0].message.content
 
async def fetch_calendar(user_id: str) -> str:
    resp = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"Calendar for {user_id}"}],
        max_tokens=200
    )
    return resp.choices[0].message.content
 
async def agent_turn(location: str, user_id: str) -> str:
    weather, calendar = await asyncio.gather(
        fetch_weather(location),
        fetch_calendar(user_id)
    )
    resp = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": f"Weather: {weather}\nCalendar: {calendar}\nSuggest activities."}]
    )
    return resp.choices[0].message.content

Token Usage Tracking

from openai import OpenAI
from dataclasses import dataclass
 
PRICING = {
    "gpt-4o": {"input": 2.50, "output": 10.00},
    "gpt-4o-mini": {"input": 0.15, "output": 0.60},
}
 
@dataclass
class TokenUsage:
    model: str
    input_tokens: int
    output_tokens: int
    cost_usd: float
 
def calculate_cost(model: str, in_tok: int, out_tok: int) -> float:
    p = PRICING.get(model, PRICING["gpt-4o-mini"])
    return (in_tok * p["input"] + out_tok * p["output"]) / 1_000_000
 
client = OpenAI()
resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello"}]
)
usage = resp.usage
cost = calculate_cost("gpt-4o-mini", usage.prompt_tokens, usage.completion_tokens)
# Log TokenUsage(model, usage.prompt_tokens, usage.completion_tokens, cost) to observability

Conversational Interview Q&A

Interviewers love "how would you..." questions. They're fishing for nuance—trade-offs, numbers, war stories. Weak answers are generic. Strong answers name the lever, the metric, and the outcome. Here's how to sound like someone who's shipped this.

1. "How do you reduce p95 latency for an agent that makes 5 sequential tool calls?"

Weak answer: "We'd use a faster model and optimize the prompts." (Vague. Doesn't address the actual bottleneck—sequential execution.)

Strong answer: "First, we check if those five calls are independent. For our Maersk email booking agent, calendar and availability fetches didn't depend on each other—we ran them in parallel with asyncio.gather. That alone cut latency from ~8s to ~3s. If some calls depend on others, we identify the critical path and parallelize within waves. We also use a smaller model for tool selection when intent is clear, and stream the final response so TTFT is low. Semantic caching helps when similar queries have been seen—skip tool calls entirely."

2. "Explain your model routing strategy."

Weak answer: "We route simple queries to a small model and complex ones to a large model." (True but shallow. How do you define simple? How do you implement it?)

Strong answer: "On our Enterprise AI Agent Platform at Maersk, we use a GPT-4o-mini classifier to label queries as simple or complex. Simple: 'what's the status of container X?', FAQ lookups, greetings. Complex: tariff comparisons, multi-shipment reasoning, policy interpretation. We achieve ~55% cost reduction. The hard part was tuning—we had false negatives early, routing complex queries to mini. We added a confidence threshold and fallback: if the small model's response is short or low-confidence, we retry with the large model. Router overhead is ~11µs—negligible."

3. "Walk through a cost optimization you implemented."

Weak answer: "We switched to a cheaper model." (One lever. No numbers. No trade-offs.)

Strong answer: "Customer support chatbot was burning ~$8K/month on GPT-4o for every request. I implemented three things: (1) Model routing—classifier sent ~60% of traffic to GPT-4o-mini. (2) Semantic cache with 0.93 threshold—40% hit rate on that traffic. (3) Token budgets—shorter system prompt, max_tokens 500 for simple queries, top-3 chunks only. Result: $2,200/month, 72% cut. Quality stayed within 2%. We did see a bump in 'unhelpful' on misrouted complex queries—tuned the classifier and added the confidence fallback."

4. "When would you run a model locally vs API?"

Weak answer: "When you need more control or lower cost." (Missing privacy, latency, volume, compliance.)

Strong answer: "Four drivers: (1) Privacy—data can't leave premises. Maersk has workloads with strict data residency. (2) Latency—no network hop, sub-100ms. (3) Volume—API costs exceed GPU + ops. (4) Offline/air-gapped. Trade-offs: local = full control, no per-token billing, but you own infra and scaling. Rule of thumb: under ~10M tokens/month, API + routing/caching usually wins. 100M+ or strict residency, self-host with vLLM."

5. "How does semantic caching work? What are the failure modes?"

Weak answer: "You cache similar queries and return cached answers. Sometimes you get wrong answers." (Doesn't explain mechanism or specific failure modes.)

Strong answer: "Embed queries, store (embedding, response) pairs. On new query, embed and search; if similarity ≥ threshold (we use 0.93), return cached. Failure modes: (1) False positives—'return policy' vs 'return address' can be similar; wrong cached answer. Tune threshold, maybe add intent classification. (2) Stale cache—policy updated, cache not. TTL or event-based invalidation. (3) Low hit rate—unique queries, caching adds overhead with no benefit. Monitor; disable for low-hit use cases. (4) Embedding drift—change embedding model, cache incompatible. Version the cache."

6. "How do you handle rate limits with parallel LLM calls?"

Weak answer: "We use retries and backoff." (Doesn't mention semaphore, proactive throttling, or batch API.)

Strong answer: "Semaphore to cap concurrency—e.g. max 10–20 in-flight so we don't burst past the provider's RPM. Each async call acquires before API, releases after. Retries with exponential backoff on 429. For batch work, Batch API has higher limits—use it when latency isn't critical. For real-time, respect RPM/TPM and design for graceful degradation: queue or 'try again' rather than hard fail."

7. "Explain quantization. Q4 vs Q8 vs FP16?"

Weak answer: "Q4 is smaller, FP16 is higher quality." (Missing when to use each, formats.)

Strong answer: "Quantization reduces weight precision. FP16 = full; Q8 = 8-bit, minimal loss; Q4 = 4-bit, smaller/faster, some loss. Use FP16 when quality is paramount and you have memory. Q8 for minimal loss with size reduction. Q4 when fitting on limited hardware—fine for classification, extraction, simple Q&A; noticeable loss on complex reasoning. GGUF for Ollama/llama.cpp; AWQ for GPU quality; GPTQ for throughput."


From Your Experience (Maersk Prompts)

When they ask "tell me about a time you..."—you've got these. Tailor the numbers and scope to the role, but the structure holds.

"How did you implement cost tracking on the Enterprise AI Agent Platform?"
We instrumented every LLM call with token counts (input, output, cached) and model name. Sent to our observability pipeline (MLflow, LangSmith-style tooling), multiplied by per-model pricing for cost per request. Aggregated by agent, feature, customer. Daily budgets, alerts at 50%, 80%, 100%. At 80% we review top cost drivers—often outlier requests or misconfigured agents—and fix before hitting cap.

"Did you implement model routing for the booking automation?"
Yes. GPT-4o-mini classifier for simple vs complex. Simple → mini; complex → full. ~55% cost reduction with minimal quality impact. Had to tune—early false negatives. Added confidence threshold and fallback: short or low-confidence response → retry with large model.

"How did you handle latency for the email booking agent?"
Multiple tool calls—calendar, email, availability. We parallelized independent ones with asyncio.gather. Streamed the final response. Used mini for tool selection when intent was clear; only the summary used GPT-4o. p95: 8s → 3s.


Quick Fire Round

Rapid recall. If you can answer these cold, you're ready for the performance section. Run through them the night before.

  1. Performance triangle? Latency, cost, quality—optimize two, the third suffers.
  2. Biggest cost lever? Model routing. 10–20x price gap between mini and full.
  3. Semantic cache threshold? 0.92–0.95; 0.93 is a safe start.
  4. TTFT? Time to first token; key UX metric for chat.
  5. Batch API discount? 50% on OpenAI Batch API.
  6. Q4 vs Q8? Q4 = smaller, faster, some loss; Q8 = minimal loss.
  7. GGUF? Format for llama.cpp/Ollama; packages model + metadata.
  8. AWQ vs GPTQ? AWQ = quality; GPTQ = throughput.
  9. Distillation? Student learns from teacher outputs; narrow, high-volume tasks.
  10. asyncio.gather? Run independent coroutines in parallel.
  11. Cache invalidation? TTL, event-based, namespace partitioning.
  12. Context caching? Cache static prompts; 50–90% lower rates.
  13. vLLM features? Continuous batching, PagedAttention, prefix caching.
  14. When to self-host? Privacy, latency, high volume, offline.
  15. Rate limit with parallelism? Semaphore + exponential backoff on 429.

Key Takeaways

Topic Takeaway
Performance triangle Optimize two of latency, cost, quality—make trade-offs explicit
Model routing Simple → mini, complex → full; 50–85% cost savings
Semantic caching Similar queries → cached response; 40–70% hit rates; tune threshold
Token budget Shorter prompts, fewer examples, context cache, max_tokens, monitor
Streaming TTFT drives UX; non-negotiable for chat
Batching 50% discount for non-urgent work; Batch API or custom
Quantization Q4 for size/speed, Q8 for quality, FP16 when paramount
Distillation Small student for narrow, high-volume tasks
Parallelism asyncio.gather for independent ops; semaphore for rate limits
Local vs API Local: privacy, latency, volume, offline; API: flexibility, low ops

Further Reading

Don't memorize these—bookmark them. Useful when you need to go deeper on a specific lever or cite a paper.

  • UniRoute (arXiv:2502.08773) — Universal model routing for dynamic LLM pools.
  • xRouter (arXiv:2510.08439) — Cost-aware LLM orchestration via reinforcement learning.
  • BEST-Route (arXiv:2506.22716) — Adaptive routing with test-time optimal compute.
  • GPTCache — Open-source semantic cache; LangChain, Redis, Chroma integrations.
  • vCache (arXiv:2502.03771) — Verified semantic prompt caching with correctness guarantees.
  • vLLM — High-throughput serving: PagedAttention, continuous batching, distributed inference.
  • OpenAI Batch API — 50% discount for bulk, non-urgent processing.
  • AWQ vs GPTQ vs GGUF — Quantization method comparison (2025–2026).
  • TAID, DistiLLM-2 — Modern distillation approaches.
  • LangSmith / MLflow / Phoenix — Observability and cost tracking for LLM apps.