Session 1: LLM Fundamentals — How Large Language Models Actually Work
You're about to learn how these things really work. Not the sanitized textbook version—the version that helps you build stuff, debug production systems, and sound like you've actually shipped AI at scale. Grab a coffee. Let's go.
The Big Picture
Before we dive into transformers and attention and all that jazz: an LLM doesn't "understand" your question. It predicts what text would typically follow your prompt in its training data. That's it. Everything else—chain-of-thought, few-shot learning, RAG—is about shaping that prediction. You're not giving instructions; you're setting up a pattern the model wants to complete. Keep that in mind and a lot of "magic" suddenly makes sense.
1. Transformer Architecture
Interview Insight: When they ask "explain attention," they don't want a paper recitation. They want to know if you understand why attention matters for your systems—can a token at position 90,000 still "see" position 5? This affects your RAG chunking strategy directly.
The Cocktail Party Analogy
Imagine you're at a crowded cocktail party. Someone asks you a question. You don't process every conversation in the room equally—you scan, gravitate toward the people whose words are relevant, and tune out the rest. Your brain is doing something like attention: weighing who matters for this moment.
That's exactly what attention does in a transformer. Each token "scans" the sequence, decides which other tokens are relevant, and blends their information accordingly. No sequential bottleneck. No waiting for the previous word to finish processing. Everything happens in parallel.
The Technical Story
The transformer architecture, from the 2017 paper "Attention Is All You Need," replaced RNNs and LSTMs as the dominant approach. The key insight: attention—the ability to weigh the importance of different parts of the input when producing each output—could be computed in parallel across the entire sequence. No more sequential bottleneck.
Encoder-decoder (original): The original transformer had two stacks. An encoder processed the input and produced a representation. A decoder generated the output one token at a time while attending to both the encoder output and its own previous tokens. Ideal for machine translation: encode the source, decode the target.
Decoder-only (GPT, Claude, Llama): Modern LLMs use decoder-only architectures. No separate encoder. One sequence, autoregressive prediction. The decoder uses causal (masked) self-attention: each position can only attend to previous positions, not future ones. The model can't cheat by looking ahead. Decoder-only models are simpler, scale better, and excel at open-ended generation.
Encoder-only (BERT): BERT and friends use only the encoder. Full sequence, bidirectional—each token can attend to all others. Great for classification, NER, embedding extraction. Not designed for generation.
Self-Attention: Q, K, V
For each token, the model computes three vectors from learned projections: Query (Q), Key (K), and Value (V). Q = XW_Q, K = XW_K, V = XW_V.
- Query: "What am I looking for?"
- Key: "What do I contain?"
- Value: The actual content we aggregate.
Attention scores = dot product of Q and K (scaled by √d_k to prevent gradient issues). Softmax gives weights that sum to 1. Output = weighted sum of Values. Each token "asks" a question, every other token "answers" with its Key, and the model combines the Values in proportion to relevance.
flowchart LR
subgraph Input["Input Tokens"]
X1[X1]
X2[X2]
X3[X3]
end
subgraph QKV["Q, K, V Computation"]
Q[Query]
K[Key]
V[Value]
end
subgraph Scores["Scores"]
Dot[Dot Product Q·K]
Scale[Scale by sqrt d]
Soft[Softmax]
end
subgraph Output["Output"]
Weighted[Weighted Sum of V]
end
X1 --> Q
X2 --> K
X3 --> V
Q --> Dot
K --> Dot
Dot --> Scale --> Soft
Soft --> Weighted
V --> WeightedMulti-head attention: Instead of one Q/K/V set, the model uses multiple heads (8 to 128). Each head learns different patterns—syntax, coreference, long-range dependencies. Outputs are concatenated and projected. More capacity, more diverse relationships in parallel.
Positional encoding: Self-attention is permutation-invariant. "Cat sat on mat" and "mat sat on cat" look the same without extra info. Positional encoding injects position. Original: sinusoidal (fixed sine/cosine). Modern: often learned embeddings or RoPE/ALiBi for better extrapolation to longer sequences.
The full block: Each transformer block has (1) multi-head self-attention + residual + LayerNorm, and (2) feed-forward network (two linear layers + GELU) + residual + LayerNorm. Residuals let gradients flow; LayerNorm stabilizes training.
flowchart TB
subgraph Block["Transformer Block"]
Input[Input]
Attn[Multi-Head Attention]
AddNorm1[Add and Norm]
FFN[Feed-Forward Network]
AddNorm2[Add and Norm]
Output[Output]
Input --> Attn
Attn --> AddNorm1
AddNorm1 --> FFN
FFN --> AddNorm2
AddNorm2 --> Output
endWhy Transformers Won
RNNs process sequentially—no parallelization across time. LSTMs helped with vanishing gradients but still struggled with long-range dependencies. Transformers process all positions in parallel. Attention creates direct connections between any two positions. Result: better scaling, faster training, superior performance.
Why This Matters in Production: When you design RAG, you're betting on attention. If your critical context is buried in the middle of 100K tokens, the model may underweight it. Place key info at the start or end. Chunk size and retrieval order aren't implementation details—they're architecture-level constraints.
2. Tokenization
Interview Insight: Tokenization questions often come up when discussing cost, extraction accuracy, or multilingual support. They're testing whether you know it's not "one word = one token" and that token boundaries affect real systems.
The LEGO Bricks Analogy
Think of tokenization like LEGO. Common words are pre-built pieces—"hello" might be a single brick. Rare words get assembled from smaller bricks: "tokenization" might be "token" + "ization" (that "ization" brick shows up in "organization," "realization," etc.). The vocabulary is a set of reusable pieces. The algorithm learns which combinations are frequent enough to deserve their own brick.
The Technical Story
Tokenization converts raw text into discrete tokens—integers the model consumes. Vocabulary size, sequence length, and handling of rare words all depend on the tokenizer.
BPE (Byte-Pair Encoding): Start with a base vocabulary of characters (or bytes for Unicode robustness). Iteratively merge the most frequent adjacent pair into a new token. "e" + "n" → "en". "en" + "c" → "enc". Repeat until you hit target vocab size (e.g., 50K). At inference: greedily apply merge rules. Common words = single tokens. Rare words = subwords.
SentencePiece and WordPiece: SentencePiece works on raw Unicode, can go byte-level—language-agnostic, robust to typos. WordPiece (BERT) uses a different merge criterion (maximize training data likelihood). In practice, similar subword vocabularies; differences are in implementation and language support.
Why "hello" = 1 token but "tokenization" = 3: Frequent words get their own token during BPE training. Rare/compound words decompose into reusable subwords. "Hello" is common. "Tokenization" splits into "token" and "ization."
Aha Moment: Tokenization isn't just an implementation detail. When your billing agent sees "$1,500", the model might see 4 separate tokens: ['$', '1', ',', '500']. Token boundaries affect extraction accuracy. Test with your actual data—don't assume. And if you're building for multiple languages, remember: the same sentence can tokenize to wildly different lengths. "Hello" is 1 token in English; the equivalent in Thai might be 5. That affects your cost model and context budgeting.
Token limits and cost: APIs charge per token (input vs output often different—e.g., $3 vs $15 per million for Claude). Longer prompts and responses cost more. Token count also affects latency.
Context window: The context window is the max tokens (input + output). 100K prompt + 128K limit = ~28K left for the response. Token count determines how much you can provide and generate.
Typical counts: GPT-4o ~128K; Claude 3.5 Sonnet ~200K; Gemini 1.5 Pro ~1M. Advertised max ≠ effective usable context—"lost in the middle" reduces it.
flowchart LR
subgraph Lifecycle["Token Lifecycle"]
Raw[Raw Text]
Tokenizer[Tokenizer]
IDs[Token IDs]
Emb[Embeddings]
Layers[Transformer Layers]
Pred[Next Token Prediction]
Raw --> Tokenizer --> IDs --> Emb --> Layers --> Pred
endWhy This Matters in Production: If you're extracting structured data from user input, token boundaries can split numbers, dates, or identifiers in weird ways. Use the same tokenizer for validation. For cost estimation, always count tokens—character counts are misleading.
3. Embeddings
Interview Insight: Embedding questions usually tie to RAG, similarity search, or "how do you find relevant documents?" They want to know you understand that embeddings are vectors in a learned space, not magic.
The GPS Coordinates Analogy
Embeddings are like GPS coordinates for meaning. Similar meanings are physically close. "Dog" and "puppy" are nearby. "Dog" and "automobile" are far apart. The classic example: king - man + woman ≈ queen. The space encodes relational structure. You're not storing definitions—you're storing positions in a learned geometry.
The Technical Story
Embeddings are dense vector representations. Each token (or phrase) maps to a vector of real numbers—768, 1536, or 3072 dimensions. Learned during training. The model uses them as the first processing layer.
Semantic capture: Words in similar contexts get similar vectors. The space encodes relational structure. Higher dimensions = more nuance, but more storage and compute. Many models support dimension reduction (e.g., truncate to 256) for efficiency.
Pre-trained vs fine-tuned: Pre-trained = large corpora (Wikipedia, web). Fine-tuned = domain data (legal, medical) for better domain performance.
Models: OpenAI text-embedding-3-small (1536), text-embedding-3-large (3072). Cohere embed-v3 (1024) for strong multilingual. Open-source: BGE, E5, GTE—often competitive at lower cost.
Similarity search: Embed query and documents. Compute cosine similarity or dot product. Vector DBs (Pinecone, Weaviate, pgvector) index for fast approximate nearest-neighbor. That's RAG's foundation: retrieve by similarity, inject into prompt.
Why This Matters in Production: Embedding model choice affects retrieval quality. A model trained on general web text may underperform on domain-specific jargon. For enterprise docs, consider fine-tuned or domain-specific embeddings. Embedding dimension affects index size and query latency.
4. Inference Parameters
Interview Insight: Parameter questions test whether you've actually tuned models in production. "What temperature for extraction?" is a real decision. They want specifics, not "it depends."
The Radio Dial Analogy
Temperature is like a radio dial. At 0, you get a boring news anchor—always the same, perfectly predictable. At 1, you're in normal territory. Crank it past 1 and you're in jazz improvisation: surprising, creative, sometimes incoherent. It's not "creativity"—it's how much you're flattening the probability distribution. At temp=2, a 1% token gets almost the same odds as a 10% token. Controlled chaos.
The Technical Story
Temperature: Scales logits before sampling. 0 = always highest-probability token (deterministic). 1 = native distribution. >1 = flattens distribution, lower-prob tokens more likely—creative but potentially incoherent. <1 = sharpens—focused, repetitive. For extraction, classification, code: 0–0.3. For creative: 0.7–1.0. Avoid >1 unless you want high variance.
Aha Moment: Temperature doesn't make the model "more creative." It flattens the probability distribution. At temp=2, a 1% token gets almost the same odds as a 10% token. It's not creativity—it's controlled chaos.
Top-p (nucleus sampling): Restrict to the smallest set of tokens whose cumulative probability exceeds p. Top-p=0.9 = only tokens covering 90% of mass, then renormalize and sample. Cuts long tails. Lower = more deterministic. Often used with temperature.
Top-k: Limit to k most probable tokens. Simpler than top-p but can be brittle—sometimes the "right" token is rank 51. Less common in modern APIs.
Frequency and presence penalty: Discourage repetition. Frequency = reduce prob proportional to how often token appeared. Presence = reduce prob of any token that appeared at least once. Use for long-form generation when the model loops.
Task-specific choices: Extraction: temp=0, top-p=1. Creative: temp=0.8–1.0. Classification: temp=0. Chatbots: temp=0.7, moderate frequency penalty. Code: temp=0.2–0.5.
Why same prompt ≠ same output: Unless temp=0, sampling is stochastic. Each run can differ. By design for creativity; problematic for reproducibility. Use temp=0 and fixed seed when you need consistency.
Why This Matters in Production: Wrong temperature is a common production bug. Extraction at temp=0.8 = inconsistent schemas, validation failures. Creative writing at temp=0 = dull, repetitive output. Set it explicitly for every use case. I've seen teams spend weeks debugging "random" extraction failures before someone checked the temperature—it was 0.9. Oops.
5. Context Windows
Interview Insight: Context window questions often lead to "lost in the middle" and RAG design. They want to know you won't blindly stuff 128K tokens and assume the model uses all of them.
The Desk Size Analogy
The context window is the model's short-term memory—or think of it as desk size. You can only fit so much on the desk. More importantly: stuff in the middle gets buried. The model "sees" the edges better. So even with a huge desk, what you put where matters. Critical documents at the edges; filler in the middle gets overlooked.
The Technical Story
The context window is the max tokens per request—input + output combined. Early GPT-3: 4K. GPT-4: 8K, 32K. GPT-4o, Claude 3: 128K–200K. Gemini 1.5 Pro: 1M. The trend is longer contexts for document analysis, RAG, long conversations.
Lost in the middle: Models attend less effectively to the middle of long contexts. Best performance when relevant info is at the start or end. U-shaped attention bias. Stuffing 128K doesn't guarantee equal use of all tokens.
RAG implications: Don't stuff the full window. Prioritize relevant chunks. Place critical context at start or end. Use re-ranking. Chunk appropriately—not too small (loses context), not too large (dilutes relevance).
Effective vs advertised: Advertised = technical limit. Effective may be lower due to lost-in-the-middle, retrieval quality, output space needs. Test with your use case.
Why This Matters in Production: RAG retrieval order matters. Put the best chunks first and last. Re-rank before injecting. Monitor for "lost in the middle" in long-document QA—if answers degrade with more context, you've hit it.
6. Model Families and Comparison
Interview Insight: Model comparison questions test cost-awareness and task-fit. They want a decision framework, not a feature list. "When would you use GPT-4o-mini over GPT-4?" is a real trade-off.
GPT-4o and GPT-4o-mini (OpenAI): GPT-4o = flagship multimodal (text, vision, audio). Strong general performance, 128K context, fast. GPT-4o-mini = smaller, cheaper—ideal for high-volume classification, extraction, simple chat. Pricing (2025): GPT-4o ~$5 input / $20 output per 1M; GPT-4o-mini ~$0.15 / $0.60. Use GPT-4o for complex reasoning, coding; GPT-4o-mini for bulk, cost-sensitive work.
Claude 3.5 Sonnet and Claude 3 Haiku (Anthropic): Claude 3.5 Sonnet excels at coding, document analysis, long reasoning. 200K context. Strong instruction-following. Haiku = budget, fast. Anthropic emphasizes safety. Pricing: Sonnet ~$3 / $15 per 1M. Choose Claude for document-heavy workflows, coding, safety focus.
Gemini 1.5 Pro and Flash (Google): 1M context—largest among majors. Multimodal. Strong for long-document analysis. Flash = faster, cheaper. Competitive pricing for long-context. Use when processing very long docs (books, transcripts) or when cost per token is key.
Llama 3 (Meta): Open-source, permissive. 8B, 70B, 405B. 8B runs on consumer hardware; 70B/405B need serious GPUs. Use when you need full control—self-hosting, fine-tuning, air-gapped deployment.
Mistral and Mixtral: Open-source and API. Mixtral = Mixture-of-Experts (MoE)—multiple experts, router activates subset per token. Less compute per token, large total capacity. Mistral 3: 3B to 675B sparse. Use for open-source flexibility, efficient MoE inference.
Decision summary: Best quality: GPT-4o or Claude 3.5 Sonnet. Cost efficiency: GPT-4o-mini, Haiku, Gemini Flash. Long context: Gemini 1.5 Pro. Coding: Claude 3.5 Sonnet or GPT-4o. Self-host/fine-tune: Llama 3 or Mistral. Multimodal: GPT-4o or Gemini. Don't default to the biggest model—match the model to the task. A 70B parameter model for simple classification is like using a semi-truck to move a couch.
flowchart TB
subgraph Models["Model Selection"]
Need[What do you need?]
Best[Best quality regardless of cost]
Cost[Cost efficiency]
Long[Long context 100K+]
Code[Coding]
Self[Self-host or fine-tune]
Need --> Best
Need --> Cost
Need --> Long
Need --> Code
Need --> Self
Best --> GPT4[GPT-4o or Claude 3.5]
Cost --> Mini[GPT-4o-mini, Haiku, Gemini Flash]
Long --> Gemini[Gemini 1.5 Pro]
Code --> Claude[Claude 3.5 or GPT-4o]
Self --> Llama[Llama 3 or Mistral]
end7. Hallucinations
Interview Insight: Hallucination questions are about mitigation, not theory. They want concrete strategies: RAG, constrained output, temperature, validation pipelines. "We use RAG and JSON schema" beats "we try to reduce them."
The Confident Student Analogy
Imagine a student who never says "I don't know." They're confident, articulate, and sometimes completely wrong. They've learned the form of good answers—citations, dates, structure—without the grounding. That's an LLM. It generates what's statistically likely, not what's verifiably true. When uncertain, it still produces plausible-sounding output. Fabricated references? The model has seen thousands of real ones and learned the format.
The Technical Story
Hallucinations are plausible-sounding but factually wrong, fabricated, or inconsistent outputs.
Why they happen: LLMs predict next tokens. They learn statistical patterns, not a fact database. They generate what's likely, not what's true. When the distribution is uncertain or the model overgeneralizes, confident-but-wrong answers emerge. Fake citations occur because the model learned the format without grounding.
Types: Factual errors (wrong dates, names, numbers). Fabricated references (invented papers, URLs, quotes). Confident wrong answers. Contradictions.
Mitigation: RAG—retrieve and inject documents as source of truth. Constrained generation—JSON schema to reduce free-form fabrication. Temperature reduction—stick to high-prob tokens. Chain-of-thought—step-by-step reasoning can improve accuracy. Fact-checking pipelines—verify against external sources or second model. Fine-tuning for grounding—train to cite sources and say "I don't know" when uncertain.
Why This Matters in Production: Hallucinations are the #1 reason RAG exists. If your use case requires factual accuracy, you need grounding. Extraction tasks: use structured output + validation. Never trust ungrounded model output for critical decisions.
8. Fine-tuning vs Prompt Engineering vs RAG
Interview Insight: This is a classic architecture question. They want a decision framework, not a list. "Start with prompt engineering, add RAG for knowledge, fine-tune only when necessary" shows you've built real systems.
Fine-tuning: Updates model weights on domain/task data. Teaches new knowledge, changes style, improves narrow tasks. When: Consistent custom behavior (tone, format, domain terms). Sufficient labeled data (hundreds to thousands). Can afford training cost. Limitations: Expensive. Knowledge cutoff. Updates require retraining.
Prompt engineering: Structures input without changing weights. System prompts, few-shot, chain-of-thought, output format. When: Fastest, cheapest. Knowledge fits in context. Task is stable. Prototyping. Limitations: Context window limit. Can't add new knowledge. Sensitive to wording.
RAG: Retrieves relevant docs, injects at query time. Model generates grounded in context. When: Large, proprietary, or frequently updated knowledge. Need citations. Reduce hallucinations. Limitations: Retrieval quality limits answer quality. Latency. Model may ignore or contradict context.
Decision flow: Start with prompt engineering—free and fast. Need external knowledge? Add RAG. Need style, format, or domain behavior prompt/RAG can't achieve? Consider fine-tuning. These are complementary; production often uses all three.
flowchart TB
subgraph Decision["When to Use What"]
Start[Start Here]
PE[Prompt Engineering]
RAG[RAG]
FT[Fine-tuning]
Start --> PE
PE --> Q1{Need external or proprietary knowledge?}
Q1 -->|No| Q2{Need custom style, format, or domain behavior?}
Q1 -->|Yes| RAG
Q2 -->|No| Done[You are done]
Q2 -->|Yes| FT
RAG --> Q2
endWhy This Matters in Production: Most teams over-engineer. Try prompt engineering first. Add RAG when you have docs. Fine-tune only when you've hit limits. At Maersk, email extraction might need RAG for templates + prompt engineering for format—fine-tuning only if those aren't enough.
Code Examples
Calling OpenAI API with Different Temperature Settings
from openai import OpenAI
client = OpenAI()
prompt = "Write a one-sentence tagline for a coffee shop."
# Deterministic (temperature=0) - same output every time
response_deterministic = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0,
max_tokens=50,
)
print("Temperature 0:", response_deterministic.choices[0].message.content)
# Creative (temperature=0.9) - varied outputs
response_creative = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.9,
max_tokens=50,
)
print("Temperature 0.9:", response_creative.choices[0].message.content)
# Run multiple times with temperature=0.9 to see variation
for i in range(3):
r = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.9,
max_tokens=50,
)
print(f" Run {i+1}: {r.choices[0].message.content}")Tokenization Example with tiktoken
import tiktoken
# Load the tokenizer used by GPT-4
enc = tiktoken.encoding_for_model("gpt-4")
text = "The quick brown fox jumps over the lazy dog. Tokenization is fascinating!"
tokens = enc.encode(text)
print(f"Text: {text}")
print(f"Token count: {len(tokens)}")
print(f"Tokens: {tokens}")
print(f"Decoded: {[enc.decode([t]) for t in tokens]}")
# Compare token counts for different texts
texts = [
"Hello",
"tokenization",
"The transformer architecture revolutionized NLP.",
]
for t in texts:
count = len(enc.encode(t))
print(f"'{t}' -> {count} token(s)")Embedding Generation and Cosine Similarity
from openai import OpenAI
import numpy as np
client = OpenAI()
def get_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]:
response = client.embeddings.create(input=[text], model=model)
return response.data[0].embedding
def cosine_similarity(a: list[float], b: list[float]) -> float:
a_np = np.array(a)
b_np = np.array(b)
return float(np.dot(a_np, b_np) / (np.linalg.norm(a_np) * np.linalg.norm(b_np)))
# Embed a few sentences
sentences = [
"The cat sat on the mat.",
"A feline rested on the rug.",
"Python is a programming language.",
]
embeddings = [get_embedding(s) for s in sentences]
# Compare similarity
print("Cosine similarities:")
for i, s1 in enumerate(sentences):
for j, s2 in enumerate(sentences):
if i <= j:
sim = cosine_similarity(embeddings[i], embeddings[j])
print(f" '{s1[:30]}...' vs '{s2[:30]}...' -> {sim:.4f}")Conversational Interview Q&A
Q1: "Explain how attention works and why it matters for long documents."
Weak answer: "Attention is when each token looks at other tokens and computes weights. There's Query, Key, Value, and you do dot products and softmax. It's from the 2017 paper."
Strong answer: "Attention lets each token directly weigh every other token when building its representation—no sequential bottleneck like RNNs. For long documents, that means a token at position 90,000 can still attend to position 5. But here's the catch: in practice, models underweight the middle. 'Lost in the middle' is real. So when I design RAG at Maersk, I don't stuff 100K tokens and hope. I put the most relevant chunks at the start and end, re-rank retrieval results, and test whether the model actually uses mid-context info. Attention gives the capacity to see everything—it doesn't guarantee the model will use it equally."
Q2: "When would you choose a smaller model over GPT-4?"
Weak answer: "When cost matters or when the task is simpler. GPT-4o-mini is cheaper."
Strong answer: "I use a tiered framework. First: does the task need complex reasoning, nuance, or multi-step logic? Legal analysis, complex code gen, creative writing—I lean GPT-4 or Claude 3.5. Second: volume and cost. At Maersk, if we're classifying millions of emails, the difference between GPT-4 and GPT-4o-mini can be 20–30x. I A/B test: if the smaller model gets 95% of the accuracy at 5% of the cost, that's the call. Third: latency. Real-time chat needs sub-second—smaller models win. Fourth: failure mode. Low-stakes (email subject suggestions)? Occasional errors OK. High-stakes (legal, medical)? Don't compromise. In practice I often use GPT-4o-mini for bulk extraction, GPT-4 for complex cases or when the smaller model's confidence is low."
Q3: "What causes hallucinations and how do you mitigate them in production?"
Weak answer: "Hallucinations happen because the model makes things up. We use RAG and lower temperature to reduce them."
Strong answer: "Hallucinations happen because LLMs predict next tokens from probability distributions—they generate what's likely, not what's true. They've learned the form of correct answers—citations, dates—without grounding. In production I use a stack: (1) RAG with high-quality retrieval—the model answers only from provided context and cites sources. (2) Constrained generation—JSON schema for extraction so we validate structure and reject invalid output. (3) Temperature 0 for factual tasks—no sampling exploration. (4) Chain-of-thought when reasoning matters—we can sometimes catch errors in the reasoning chain. (5) Fact-checking for critical claims—second model or rule-based validation. (6) Train or prompt the model to say 'I don't know' when it lacks info, instead of guessing. For email extraction at Maersk, schema validation + RAG with email templates cut fabricated fields dramatically."
Q4: "Explain fine-tuning vs prompt engineering vs RAG. When do you pick each?"
Weak answer: "Prompt engineering is fast, RAG adds knowledge, fine-tuning changes the model. You pick based on the use case."
Strong answer: "I start with prompt engineering—it's free and instant. System prompts, few-shot, output format. If that works and knowledge fits in context, I'm done. If I need external or proprietary knowledge—company docs, product catalogs, support articles—I add RAG. Retrieve at query time, inject into prompt. That's most enterprise use cases. Fine-tuning is the last resort: when I need persistent custom behavior—specific tone, format, domain terminology—that prompt and RAG can't achieve. Fine-tuning needs labeled data and compute; updates require retraining. My flow: prompt first, RAG for knowledge, fine-tune only when necessary. Production systems often combine all three—fine-tuned base, RAG for knowledge, prompt engineering to tie it together."
Q5: "A model has a 128K context window. Should you stuff 128K tokens into every request?"
Weak answer: "No, because of cost and lost in the middle."
Strong answer: "No. 128K is a technical limit, not a recommendation. Three reasons: (1) Lost in the middle—models underweight the middle of long contexts. Stuff 128K and your critical info might be ignored if it's buried. For RAG, place the best chunks at start and end. (2) Cost and latency—input tokens are billed. More tokens = higher cost, slower processing. If 10K is enough, adding 118K hurts. (3) Signal vs noise—marginal docs dilute relevance. Five great chunks beat 50 mixed-quality ones. Plus the window is shared: 120K input leaves only 8K for output. I retrieve only what's relevant, re-rank, place critical context strategically, and test to find the optimal size—usually much smaller than the max."
From Your Experience
Prepare your own stories using the STAR format (Situation, Task, Action, Result). Use these prompts to reflect on your work at Maersk building an AI platform and email booking automation:
Which models did you use at Maersk? Why Azure OpenAI specifically? Consider: Did you use GPT-4, GPT-3.5, or other models? What drove the choice of Azure OpenAI over direct OpenAI or other providers? Think about enterprise requirements: data residency, compliance, SLAs, existing Azure contracts, and security.
How did you handle model selection on the AI platform? What was the decision framework? Consider: Did you support multiple models? How did you decide which model to use for which use case? Think about cost, latency, accuracy, and user requirements. Did you implement model routing, fallbacks, or A/B testing?
Did you encounter hallucination issues in the email extraction system? How did you solve them? Consider: Email extraction involves structured data (dates, booking references, addresses). Did the model ever invent fields, misparse dates, or fabricate values? What mitigations did you use—constrained output schemas, validation, RAG with email templates, temperature settings, or human-in-the-loop review?
Quick Fire Round
Flashcard-style. Drill these until they're automatic.
Q: What does temperature=0 do?
A: Always picks the highest-probability token. Fully deterministic.
Q: BPE in one sentence?
A: Iteratively merge the most frequent character pairs until target vocab size.
Q: What are Q, K, V in attention?
A: Query (what I'm looking for), Key (what I contain), Value (content to aggregate). Learned projections from input.
Q: Why "lost in the middle"?
A: Models attend less effectively to the middle of long contexts. Best performance when relevant info is at start or end.
Q: When to use RAG over fine-tuning?
A: When knowledge is large, proprietary, or frequently updated. RAG adds knowledge at query time without retraining.
Q: What does top-p do?
A: Nucleus sampling—restrict to smallest set of tokens whose cumulative probability exceeds p, then renormalize and sample.
Q: Why can tokenization affect extraction accuracy?
A: Token boundaries can split numbers, dates, IDs. "$1,500" might be 4 tokens. Same tokenizer for validation.
Q: Encoder-only vs decoder-only?
A: Encoder-only (BERT): bidirectional, good for classification/embedding. Decoder-only (GPT): causal, autoregressive, good for generation.
Q: What's the main cause of hallucinations?
A: LLMs predict next tokens from probability distributions, not fact databases. They generate what's likely, not what's true.
Q: When would you use Gemini 1.5 Pro over GPT-4o?
A: When you need 1M token context—very long documents, books, transcripts. Or when cost per token is primary.
Q: What does a residual connection do?
A: Adds input to output (e.g., output = x + Attention(x)). Lets gradients flow directly, enables training deep networks.
Q: Temperature doesn't make the model "creative"—what does it actually do?
A: Flattens the probability distribution. At temp>1, lower-prob tokens get more weight. It's controlled chaos, not creativity.
Q: What's the first thing you try when adding external knowledge to an LLM?
A: RAG. Prompt engineering first for structure; RAG when you need docs. Fine-tuning is the last resort.
Key Takeaways (Cheat Sheet)
| Topic | Key Point |
|---|---|
| Transformer | Decoder-only (GPT), encoder-only (BERT), encoder-decoder (original). Self-attention: Q, K, V → scores → softmax → weighted sum of V. Multi-head = multiple attention patterns. Positional encoding fixes permutation invariance. |
| Tokenization | BPE: merge frequent pairs iteratively. "hello" = 1 token, "tokenization" = 3. Token count affects cost and context. |
| Embeddings | Dense vectors (768–3072 dims). Semantic similarity via cosine/dot product. Enable RAG and similarity search. |
| Temperature | 0 = deterministic, 1 = balanced, >1 = chaotic. Low for extraction, high for creative. |
| Top-p | Nucleus sampling: smallest set of tokens with cumulative prob > p. Filters long tail. |
| Context window | Max input + output tokens. Lost in the middle: prioritize start/end. Don't stuff full window. |
| GPT-4o | Multimodal, 128K, strong all-rounder. GPT-4o-mini = cheap, fast. |
| Claude 3.5 | 200K context, strong on docs and coding. Anthropic safety focus. |
| Gemini 1.5 Pro | 1M context, good for very long docs. Competitive pricing. |
| Llama/Mistral | Open-source, self-host, fine-tune. MoE (Mixtral) = efficient. |
| Hallucinations | Caused by next-token prediction, not fact DB. Mitigate: RAG, low temp, constrained output, fact-checking. |
| Fine-tune vs RAG vs Prompt | Prompt = fast, free. RAG = add knowledge. Fine-tune = custom behavior. Use all three together. |
Further Reading (Optional)
- Attention Is All You Need (Vaswani et al., 2017) — Original transformer paper
- Lost in the Middle (Liu et al., 2023) — Context position bias in long documents
- The Illustrated Transformer — Visual guide to attention
- OpenAI API Documentation — Pricing, parameters, models
- Anthropic Claude Documentation — Claude models and best practices