Session 4: Advanced RAG Patterns
The stuff that separates "I built a RAG" from "I built a RAG that actually works."
This session is your deep dive into advanced Retrieval-Augmented Generation—the patterns that turn a demo into a production system. Hybrid search, reranking, query transformation, agentic RAG, CRAG, GraphRAG, and how to evaluate all of it. The goal isn't to memorize papers; it's to know what to reach for when naive RAG fails. Study from this file alone—no external links required during prep. But here's the thing: we're going to make it interesting.
The Big Picture: Why Naive RAG Isn't Enough
Anyone can slap a vector store in front of an LLM and call it RAG. Query goes in, chunks come out, LLM generates. Done. Except... it's not done. Your users get wrong answers. Irrelevant context pollutes the response. Exact matches get missed. The system works great on your demo queries and falls apart in production.
Advanced RAG questions separate senior from mid-level. Anyone can build naive RAG. The interview tests whether you know what to do when naive RAG FAILS.
So let's get into the patterns that actually fix things. And we'll start with analogies—because if you can explain why something works in plain English, you understand it. If you can only recite the paper, you're one step away from cargo-culting.
1. Hybrid Search: Two Ways to Find a Book
The Analogy First
Imagine you're searching a library. You have two options:
Option A: Card catalog (keyword search). You look up "diabetes treatment" and get every book whose card mentions those exact words. Precise. Literal. If the book says "insulin resistance" but never says "diabetes," you won't find it.
Option B: Walking to the "meaning neighborhood" (vector search). You ask a librarian "Where are books about blood sugar problems?" and they point you to a section. You might find books about diabetes, insulin, metabolic syndrome—all semantically related. But if you're looking for a specific error code like "ERR-4052," good luck. That string might not even be in the embedding model's vocabulary.
Hybrid search = doing BOTH. You search the card catalog AND you walk to the meaning neighborhood. Then you combine both result lists. Documents that show up in both lists rise to the top. You get the best of exact matching and semantic understanding.
Interview Insight: Hybrid search questions test whether you understand WHY vector search alone isn't enough. Hint: try searching for "error code ERR-4052" with pure vector search. It won't work.
The Technical Details
Sparse retrieval (BM25, TF-IDF) matches keywords. Term frequency, inverse document frequency—documents with more occurrences of rare, important terms rank higher. Dense retrieval (vector embeddings) captures meaning. "Automobile" and "car" are close in embedding space.
The problem with raw scores: BM25 produces unbounded scores. Vector search produces cosine similarities between -1 and 1. You can't just average them. So we use Reciprocal Rank Fusion (RRF)—we ignore raw scores entirely and use only rank positions. For each document: add 1/(k + rank) for every list it appears in. Typical k = 60. Sum across retrievers, re-sort. Documents that rank highly in both BM25 and vector search get contributions from both and rise to the top. No tuning. No normalization. It just works.
Alternative: linear combination. Some systems normalize scores (min-max or z-score) and do α × vector + (1 − α) × BM25. Typical split: 0.7 vector, 0.3 keyword. Tune α on your eval set—domains with lots of IDs and codes may need 0.5/0.5.
Why This Matters in Production
In production, users search for product IDs, error codes, case names, and technical identifiers. Pure vector search will miss these. Pure keyword search will miss paraphrases and conceptual questions. Hybrid search is table stakes for any serious RAG system. Elasticsearch, OpenSearch, Milvus, and Azure AI Search all support RRF natively now. Use it. At Maersk-scale logistics, a user searching for "container MSCU1234567890" needs that exact match—BM25 delivers. A user asking "how do I track my shipment" needs semantic understanding—vectors deliver. You need both.
Aha Moment: BM25 (keyword search) was invented in the 1990s. It STILL beats vector search for exact term matching. The best retrieval systems in 2026 combine a 30-year-old algorithm with cutting-edge neural embeddings.
Architecture: Hybrid Search Flow
flowchart TB
subgraph Input
Q[User Query]
end
subgraph Retrieval
Q --> BM25[BM25 / TF-IDF]
Q --> EMB[Embedding Model]
EMB --> VEC[Vector Search]
BM25 --> R1[Ranked List 1]
VEC --> R2[Ranked List 2]
R1 --> RRF[Reciprocal Rank Fusion]
R2 --> RRF
end
subgraph Output
RRF --> MERGED[Merged and Re-ranked Results]
MERGED --> LLM[LLM Generation]
end2. Reranking: The Two-Round Interview
The Analogy First
Round 1: Quick resume screen. You have 500 applicants. You run a fast filter—keyword match, basic qualifications. You narrow it to 50 candidates. Fast. Rough. You might have missed some gems; you might have let some duds through. But you've got a manageable shortlist.
Round 2: Detailed interview. For each of those 50, you actually sit down and talk to them. You ask questions. You see how they respond to your specific needs. This is slow—you can't interview 500 people. But for the 50 you did interview, you have a much better sense of who's actually right for the job.
Reranking = that two-round process. Round 1: vector search (or hybrid) retrieves 50–100 candidates. Fast. Round 2: a cross-encoder scores each (query, document) pair. Slow but accurate. You keep the top 5–10 and throw away the rest. Seems wasteful? It's not. It's the only way to get accuracy without encoding your entire corpus with a model that sees query and document together.
Aha Moment: Reranking seems wasteful—why retrieve 50 documents just to throw away 45? Because bi-encoders (used in vector search) encode query and document SEPARATELY. Cross-encoders (used in reranking) see them TOGETHER. It's like the difference between reading two resumes independently vs. interviewing someone and asking questions. The second is more accurate but too slow to do for every candidate.
The Technical Details
Bi-encoders (vector search): Query embedding computed once. Document embeddings precomputed at index time. Similarity = dot product of two vectors. The model never sees query and document together.
Cross-encoders: Concatenate query and document. Single forward pass. The model learns fine-grained interactions—which phrases directly address the query, which are tangential, which contradict. Much more accurate. But: 50 candidates = 50 forward passes per query. Hence the two-stage design.
Models: Cohere Rerank (~$1 per 1K queries). Jina Reranker v3 (0.6B params, state-of-the-art on BEIR). Open-source: ms-marco-MiniLM-L6-v2, ms-marco-MiniLM-L12-v2. Self-host for zero per-query cost.
When to use it: High recall but noisy precision. Limited context window—you can only pass a few chunks. High-stakes applications (legal, medical). When to skip: Small corpus, already precise retrieval, or latency/cost paramount. Reranking adds 100–500ms. For high-volume chat, you might skip it.
Why this matters in production: When your context window is limited (e.g., you can only fit 5 chunks), those 5 had better be the right 5. Retrieving 50 and reranking to 5 is often the difference between a correct answer and a hallucinated one. The 100–500ms latency is usually acceptable for enterprise use cases where accuracy trumps speed.
Reranking Pipeline Diagram
flowchart TB
Query[User Query] --> Retrieve[Stage 1: Retrieve Many]
Retrieve --> CandidatePool[Candidate Pool: 50-100 docs]
CandidatePool --> CrossEncoder[Stage 2: Cross-Encoder Rerank]
CrossEncoder --> TopK[Top-K: 5-10 docs]
TopK --> LLM[LLM Generation]3. Query Transformation: Making Your Query Smarter
HyDE: The Fake Answer Trick
The analogy: Instead of searching for your question, you write a FAKE answer first. Then you search for documents similar to that fake answer. Weirdly, this often works better.
Why? Your query lives in "question space"—short, abstract, vague. "What causes diabetes?" Documents live in "document space"—long, detailed, full of domain vocabulary like "insulin resistance," "pancreatic beta cells," "blood glucose regulation." The embedding of your question might not align well with document embeddings. But a hypothetical answer—generated by an LLM—uses vocabulary and structure similar to real documents. It sits closer to them in embedding space. So you embed the fake answer and search with that.
Flow: Query → LLM generates 3–5 hypothetical answers → embed each → average embeddings → search with averaged vector → retrieve real documents.
When it helps: Short queries, domain-specific, vocabulary gap between queries and docs. When it hurts: LLM generates wrong terminology, or latency/cost from the extra LLM call is prohibitive.
Aha Moment: HyDE sounds insane—generate a fake answer and search with THAT? But it works because the fake answer is in "document space" (long, detailed, uses domain vocabulary) while your query is in "question space" (short, vague). Documents are closer to other documents than they are to questions.
Multi-Query: Ask Five Different Ways
The analogy: You ask the same question 5 different ways and combine all the answers. "What are RAG limitations?" → "RAG failure modes," "when does RAG not work," "RAG disadvantages," "problems with retrieval-augmented generation." Run retrieval for each. Merge. Deduplicate. Different phrasings surface different documents. Recall goes up.
Trade-off: Multiple retrieval calls = cost. Some variations may retrieve junk. Best when recall matters more than precision. In practice, 3–5 variations is the sweet spot—diminishing returns after that, and the noise from irrelevant variations starts to hurt.
Step-Back: Broaden Before You Search
The analogy: Your query is too specific. "What is the boiling point of water at 5000 ft elevation?" Documents probably discuss "how altitude affects boiling point" in general terms. So you step back: "How does altitude affect boiling point?" Retrieve that. Use the broader context to answer the specific question. Fixes ~21.6% of RAG errors from over-specific queries.
Query Decomposition: Break It Down
Multi-hop questions need info from multiple documents. "Which company among NVIDIA, Apple, and Google made the biggest profit in 2023?" Decompose into: "NVIDIA profit 2023," "Apple profit 2023," "Google profit 2023." Retrieve for each. Merge. Deduplicate. Rerank. Research shows +36.7% MRR@10 and +11.6% F1 over standard RAG. No task-specific training. Drop-in enhancement.
Query Transformation Comparison Diagram
flowchart TB
subgraph Original
Q1[Original Query] --> R1[Single Retrieval]
end
subgraph HyDE
Q2[Original Query] --> Gen2[LLM: Generate Fake Answer]
Gen2 --> Embed2[Embed Fake Answer]
Embed2 --> R2[Search with Fake Answer]
end
subgraph MultiQuery
Q3[Original Query] --> Gen3[LLM: 5 Query Variations]
Gen3 --> R3a[Retrieve 1]
Gen3 --> R3b[Retrieve 2]
Gen3 --> R3c[Retrieve 3]
R3a --> Merge3[Merge and Dedupe]
R3b --> Merge3
R3c --> Merge3
end
subgraph StepBack
Q4[Specific Query] --> Step4[LLM: Broaden Query]
Step4 --> R4[Retrieve with Broad Query]
end4. Parent-Document Retrieval: Search by Sentence, Read the Full Page
The analogy: You're looking for a specific sentence in a book. You use the index (small chunks) to find it. But when you get there, you don't just read that one sentence—you read the full page. The sentence might reference "the policy" or "Section 3.2" without the surrounding context. The full page gives you definitions, preceding paragraphs, the big picture.
Technical: Index small chunks (128–400 tokens) for precise retrieval. When a chunk matches, return the parent document (1024–2048 tokens) that contains it. Store child→parent mapping. Embed only children at index time. At query time: retrieve top-k children, fetch their parents, pass parents to the LLM. Small chunks retrieve well; parents give context. Essential for legal docs, technical documentation, research papers.
Why this matters in production: In technical documentation, a chunk might say "see Section 3.2 for details" without including Section 3.2. The parent document has the full section. In legal contracts, definitions in one clause reference terms defined elsewhere in the same document. Without the parent, the LLM is guessing. Parent-document retrieval is one of the highest-impact, lowest-complexity improvements you can make.
5. Contextual Compression: Cut the Noise
After retrieval, you have chunks. Some contain irrelevant sentences. Contextual compression: for each chunk, send it + query to an LLM. "Extract only the sentences relevant to answering this question." Get condensed versions. Pass those to the main LLM. LangChain's LLMChainExtractor does this. Trade-off: Extra LLM call per chunk. Latency. Cost. Can over-compress—might drop nuanced context. Use when chunks are long and noisy; skip when they're already focused.
6. Agentic RAG: The Researcher, Not the Vending Machine
The Analogy First
Naive RAG = vending machine. Put in a query, get a response. Same pipeline every time. No decisions. No iteration.
Agentic RAG = researcher. The agent DECIDES: Do I need to search? What should I search for? Are these results good enough? Should I try again with a different query? Maybe I need two separate searches for two different topics. Maybe this question doesn't need retrieval at all—"What is 2+2?" The agent routes, decomposes, evaluates, and iterates.
Aha Moment: Agentic RAG is where RAG meets agents. The agent can say "these results suck, let me rephrase my query" or "I need information from two different topics, let me do two separate searches." This is the future of RAG.
The Technical Details
The agent has a retrieval tool. It can call it multiple times. It can decompose "Compare refund policies of Company A and B" into "Company A refund policy" and "Company B refund policy." It can evaluate: "Do I have enough?" If not, refine and retrieve again. LangGraph is the go-to framework: define nodes (route, retrieve, generate, evaluate), conditional edges, loops back to retrieval.
When to use: Variable query complexity. Multi-hop questions. Unreliable retrieval. Exploratory tasks. When to skip: High-volume, simple Q&A. Latency and cost matter. Agentic RAG adds round-trips.
Why this matters in production: Customer support bots get "What's your refund policy?" (simple) and "I bought item X from store Y last month, it arrived broken, and I want a refund—what are my options?" (complex, multi-hop). A fixed pipeline treats both the same. An agentic system routes the first to a quick retrieval, the second to decomposition and multiple retrievals. You save cost on simple queries and get accuracy on hard ones.
Agentic RAG Decision Flow
flowchart TB
Start([User Query]) --> Route{Need Retrieval?}
Route -->|No| GenDirect[Generate from Parametric Knowledge]
Route -->|Yes| Retrieve[Retrieve Documents]
Retrieve --> Eval{Quality Sufficient?}
Eval -->|Yes| Gen[Generate Answer]
Eval -->|No| Refine[Refine Query or Decompose]
Refine --> Retrieve
Gen --> Critique{Answer Supported?}
Critique -->|Yes| ReturnAnswer([Return Answer])
Critique -->|No| Retrieve
GenDirect --> ReturnAnswer7. CRAG: The Quality Control Inspector
The analogy: CRAG = RAG with a quality control inspector. Before you pass retrieved docs to the LLM, the inspector checks: "Are these actually relevant?" Three possible verdicts:
- Correct: Proceed. Use the docs. Optionally decompose and recompose.
- Ambiguous: Some relevance but incomplete. Refine the query, re-retrieve, or supplement with web search.
- Incorrect: These docs are garbage. Discard. Fall back to web search or say "I don't have sufficient information."
Naive RAG blindly passes whatever is retrieved. Bad retrieval → LLM hallucinates. CRAG adds a gate. The evaluator can be a small classifier or lightweight LLM call. One extra inference per query. Worth it when your corpus is incomplete or queries often fall outside the knowledge base.
Why this matters in production: Internal knowledge bases are never complete. Users ask about products that launched yesterday, policies that changed last week, edge cases that were never documented. Without CRAG, the system retrieves the "closest" docs (which might be irrelevant) and the LLM confidently hallucinates. With CRAG, the system says "I don't have sufficient information" or falls back to web search. Trust > confidence.
CRAG Flow Diagram
flowchart TB
Q[Query] --> Retrieve[Initial Retrieval]
Retrieve --> Eval[Retrieval Quality Evaluator]
Eval --> Classify{Classification}
Classify -->|Correct| Decompose[Decompose and Recompose]
Classify -->|Ambiguous| Refine[Refine Query]
Classify -->|Incorrect| Fallback[Web Search or Alternative]
Decompose --> Gen[Generate Answer]
Refine --> Retrieve
Fallback --> Gen
Gen --> Out([Final Answer])8. GraphRAG: Connecting the Dots
The analogy: Vector search finds individual facts. Graph search finds RELATIONSHIPS between facts. "Person A reports to Person B." "Department X conflicts with Department Y." "Event 1 caused Event 2." Vector similarity might retrieve chunks that mention these entities. But the connections—who reports to whom, what caused what—live in the graph structure. GraphRAG extracts entities and relationships, builds a knowledge graph, and uses graph traversal alongside vector search.
Microsoft's GraphRAG: Extracts entities, relationships, claims. Builds hierarchical graph with Leiden clustering. Generates community-level summaries. Beats vector-only RAG on holistic understanding (summarizing across large collections) and connection synthesis (linking disparate info). "What are the main conflicts in this organization?"—the graph reveals structure that vectors miss.
When to use: Rich relational structure. Multi-hop reasoning. Entity-centric domains (orgs, people, events). When to skip: Single-hop questions. Independent documents. Extraction adds complexity and can introduce errors.
Why this matters in production: Enterprise knowledge bases are full of relationships—org charts, project dependencies, vendor relationships, incident causality. "What caused the outage?" isn't just "find chunks about outages"—it's "trace the chain of events." GraphRAG excels at these. The indexing cost is higher (entity extraction, relation extraction, graph construction), but for the right domain, the payoff is substantial.
9. Self-RAG, RAPTOR, Multi-Modal: The Rest of the Toolkit
Self-RAG: The model itself (trained with special tokens) decides: retrieve? are passages relevant? is my answer supported? Reflection tokens trigger retrieval, critique, retry. End-to-end adaptable. Requires training—not plug-and-play.
RAPTOR: Hierarchical tree of summaries. Cluster chunks → summarize clusters → repeat. Retrieve at multiple abstraction levels. Great for "What are the main themes?"—high-level summaries may have the answer directly. Trade-off: expensive indexing. RAPTOR improves QuALITY benchmark by 20% absolute and QASPER F1 from 36% to 56%—significant gains for synthesis-heavy questions.
Multi-Modal RAG: Text + images + tables. OCR, captioning, CLIP, or multimodal LLMs. Tables in PDFs are hard—layout models help. VDocRAG (2025), RAG-Anything (2025) push the frontier.
10. Evaluation: Did Any of This Actually Help?
Metrics: Precision@k (fraction of top-k that are relevant). Recall@k (fraction of all relevant in top-k). MRR (1/rank of first relevant). NDCG (relevance + position). You need relevance judgments—human or LLM-as-judge.
LLM-as-judge: "On a scale of 1–5, how relevant is this document?" Scales better than humans. May introduce bias. UDCG (2025): Utility and distraction-aware metric; improves correlation with answer accuracy by up to 36% over traditional metrics.
Build an eval set: Collect queries. Retrieve. Label (query, doc) pairs. Compute metrics. Iterate.
Why this matters in production: You can't improve what you don't measure. Every RAG system in production should have an eval set—even 50 queries with relevance labels. Log retrieval results, compute precision@k weekly, track drift. When a new embedding model or chunking strategy ships, run the eval before and after. Evaluation is the difference between "we think it's better" and "we measured it's better."
Interview Insight: When they ask "your RAG returns 30% irrelevant results," they want a DIAGNOSTIC FRAMEWORK, not a single fix. Show you can think systematically.
Code Examples
Hybrid Search with BM25 + Vector Search + RRF Fusion
from rank_bm25 import BM25Okapi
import numpy as np
from typing import List, Tuple
def reciprocal_rank_fusion(
ranked_lists: List[List[str]],
k: int = 60
) -> List[Tuple[str, float]]:
"""Merge multiple ranked lists using RRF. Each list is [doc_id, doc_id, ...]"""
scores = {}
for doc_list in ranked_lists:
for rank, doc_id in enumerate(doc_list):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
sorted_docs = sorted(scores.items(), key=lambda x: x[1], reverse=True)
return sorted_docs
def hybrid_search(
query: str,
documents: List[str],
doc_ids: List[str],
embedding_fn,
vector_index,
tokenizer_fn,
top_k: int = 10,
rrf_k: int = 60
) -> List[str]:
# BM25 retrieval
tokenized_docs = [tokenizer_fn(d) for d in documents]
bm25 = BM25Okapi(tokenized_docs)
tokenized_query = tokenizer_fn(query)
bm25_scores = bm25.get_scores(tokenized_query)
bm25_top_indices = np.argsort(bm25_scores)[::-1][:top_k * 2] # Get more for fusion
bm25_doc_ids = [doc_ids[i] for i in bm25_top_indices if bm25_scores[i] > 0]
# Vector retrieval
query_embedding = embedding_fn(query)
vector_results = vector_index.search(query_embedding, top_k=top_k * 2)
vector_doc_ids = [r.id for r in vector_results]
# RRF fusion
fused = reciprocal_rank_fusion([bm25_doc_ids, vector_doc_ids], k=rrf_k)
return [doc_id for doc_id, _ in fused[:top_k]]Reranking with Cohere or Cross-Encoder
# Using Cohere Rerank API
from cohere import Client
def rerank_cohere(query: str, documents: List[str], top_n: int = 5) -> List[str]:
cohere = Client(api_key="YOUR_API_KEY")
response = cohere.rerank(
model="rerank-english-v3.0",
query=query,
documents=documents,
top_n=top_n
)
return [documents[r.index] for r in response.results]
# Using open-source cross-encoder (sentence-transformers)
from sentence_transformers import CrossEncoder
def rerank_cross_encoder(query: str, documents: List[str], top_n: int = 5) -> List[str]:
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [(query, doc) for doc in documents]
scores = model.predict(pairs)
ranked_indices = np.argsort(scores)[::-1][:top_n]
return [documents[i] for i in ranked_indices]Multi-Query Retrieval with LangChain
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
def setup_multi_query_retriever(vectorstore, llm, k: int = 4):
base_retriever = vectorstore.as_retriever(search_kwargs={"k": k})
multi_retriever = MultiQueryRetriever.from_llm(
retriever=base_retriever,
llm=llm
)
return multi_retriever
# Usage: multi_retriever invokes the LLM to generate query variations,
# retrieves for each, deduplicates, and returns merged results.Agentic RAG with LangGraph (Simplified)
from langgraph.graph import StateGraph, END
from typing import TypedDict, List, Literal
class RAGState(TypedDict):
query: str
documents: List[str]
answer: str
retrieval_count: int
need_more: bool
def route_query(state: RAGState) -> Literal["retrieve", "generate"]:
"""Agent decides: does this query need retrieval?"""
if "?" in state["query"] or any(w in state["query"].lower() for w in ["policy", "how", "what", "when"]):
return "retrieve"
return "generate"
def retrieve(state: RAGState) -> RAGState:
docs = retriever.invoke(state["query"])
return {**state, "documents": docs, "retrieval_count": state.get("retrieval_count", 0) + 1}
def generate(state: RAGState) -> RAGState:
context = "\n\n".join(state["documents"]) if state["documents"] else ""
answer = llm.invoke(f"Context:\n{context}\n\nQuestion: {state['query']}")
return {**state, "answer": answer.content}
def should_retrieve_more(state: RAGState) -> Literal["retrieve", "end"]:
if state["retrieval_count"] >= 2:
return "end"
return "end"
graph = StateGraph(RAGState)
graph.add_node("retrieve", retrieve)
graph.add_node("generate", generate)
graph.add_conditional_edges("__start__", route_query, {"retrieve": "retrieve", "generate": "generate"})
graph.add_edge("retrieve", "generate")
graph.add_conditional_edges("generate", should_retrieve_more, {"retrieve": "retrieve", "end": END})
app = graph.compile()Simple CRAG-Style Retrieval Quality Check
def evaluate_retrieval_quality(query: str, documents: List[str], llm) -> str:
"""Returns 'correct', 'ambiguous', or 'incorrect'."""
context = "\n\n".join(documents[:5])
prompt = f"""Evaluate whether these retrieved documents are relevant to the query.
Query: {query}
Documents:
{context}
Respond with exactly one word: correct, ambiguous, or incorrect.
- correct: Documents clearly contain information to answer the query.
- ambiguous: Some relevance but incomplete or uncertain.
- incorrect: Documents are not relevant to the query.
"""
response = llm.invoke(prompt)
label = response.content.strip().lower()
if label not in ("correct", "ambiguous", "incorrect"):
return "ambiguous"
return label
def crag_retrieve(query: str, retriever, llm, web_search_fn=None):
docs = retriever.invoke(query)
quality = evaluate_retrieval_quality(query, docs, llm)
if quality == "correct":
return docs
elif quality == "ambiguous":
refined = llm.invoke(f"Rewrite this query for better retrieval: {query}")
return retriever.invoke(refined.content)
else:
if web_search_fn:
return web_search_fn(query)
return []Conversational Interview Q&A: Weak vs Strong Answers
Q1: "Your RAG returns 30% irrelevant chunks. How do you fix it?"
Weak answer: "I'd add reranking." (Single fix. No diagnosis. Interviewer thinks you're guessing.)
Strong answer: "First, I'd diagnose where the failure is. Build an eval set of 50–100 queries with ground-truth relevant docs. Log retrieved chunks and relevance. Is it retrieval (wrong chunks) or generation (right chunks, wrong use)? If retrieval: inspect failure modes. Missing exact matches (IDs, codes)? Add hybrid search with BM25. Missing semantic matches? Try HyDE or multi-query. Too much noise? Add reranking—retrieve 50, rerank to top 5. Queries too specific? Step-back prompting. Then re-evaluate. Measure precision@k, recall@k, answer accuracy. Iterate. The key is data-driven diagnosis, not throwing techniques at the wall."
Q2: "Explain hybrid search. When does BM25 beat vector search?"
Weak answer: "Hybrid combines BM25 and vectors." (No when or why.)
Strong answer: "Hybrid runs both sparse (BM25) and dense (vector) retrieval in parallel, then fuses with RRF. BM25 beats vector when exact term matching matters: product IDs like SKU-7842, error codes like ERR_404, proper nouns, technical terminology. Vector search can't align these—sparse or unique embeddings. Vector wins for paraphrasing, synonyms, conceptual questions. Best approach: run both, fuse, let each compensate for the other's weaknesses."
Q3: "Why rerank instead of using a better embedding model?"
Weak answer: "Reranking is more accurate." (Doesn't address the trade-off.)
Strong answer: "A better bi-encoder still encodes query and documents separately. The model never sees them together. Cross-encoders do—they concatenate query and doc, one forward pass. That's why they're more accurate. But you can't run a cross-encoder on millions of docs at index time. Reranking is the compromise: cheap bi-encoder narrows to 50 candidates, expensive cross-encoder scores only those 50. You get accuracy where it matters—the top of the list—without encoding the whole corpus. Typically 10–25% accuracy gain for 100–500ms latency."
Q4: "What's agentic RAG? When would you use it?"
Weak answer: "It's RAG with an agent." (Vague.)
Strong answer: "Agentic RAG replaces the fixed pipeline with an agent that decides when to retrieve, what to retrieve, and whether to iterate. The agent has a retrieval tool, can call it multiple times, can decompose queries, evaluate quality, retry. Use it when: variable query complexity (some need no retrieval, some need multi-hop), multi-hop questions, unreliable retrieval, exploratory tasks. Skip it when: high-volume simple Q&A, latency and cost matter. Agentic adds round-trips. LangGraph is the standard framework."
Q5: "Design RAG for 10M legal documents. Walk through your choices."
Weak answer: "I'd use vector search and maybe hybrid." (Surface-level.)
Strong answer: "Chunking: semantic chunking that respects section boundaries. 400–600 tokens. Parent-document retrieval—index small chunks, return parent sections (1500–2000 tokens) for context. Retrieval: hybrid search essential—legal queries use exact terms (Section 12.3, case names). BM25 + vector, RRF fusion. Reranking: cross-encoder, retrieve 50, rerank to top 5. Legal accuracy is critical. Query transformation: step-back for over-specific queries, multi-query for recall. Indexing: scalable vector DB with filtering by jurisdiction, type, date. Sharding. Incremental indexing. Evaluation: eval set from real queries + expert relevance judgments. Precision@5, recall@10, MRR, citation correctness. Wrong citations are unacceptable in legal."
Q6: "What's GraphRAG? When do you use knowledge graphs with RAG?"
Weak answer: "GraphRAG uses knowledge graphs." (No when.)
Strong answer: "GraphRAG extracts entities and relationships, builds a graph, uses graph traversal alongside vector search. Vector finds similar text; graph finds structure—who reports to whom, what caused what. Use it when: relational queries, entity-centric domains (orgs, people, events), holistic summarization, connection synthesis across documents. Skip when: single-hop questions, independent docs, or extraction would introduce errors. Microsoft's GraphRAG uses Leiden clustering and community summaries. Adds complexity but beats vector-only on holistic and connection tasks."
Q7: "How do you handle a corpus that changes daily? Real-time docs?"
Weak answer: "I'd reindex." (No nuance.)
Strong answer: "Daily updates: incremental indexing. Upsert new docs to vector store. For BM25, update inverted index—some systems support incremental, others need periodic rebuilds. Nightly job: fetch new/changed docs, process, upsert. Optional 'last updated' filter. Real-time: tiered approach. Hot path—small, frequently updated index (e.g., last 24 hours), fast upserts. Cold path—batch-updated historical index. Queries search both, merge. Or use Elasticsearch/OpenSearch with 1-second refresh. Trade-off: real-time adds complexity. Most apps, hourly or daily batch is enough."
Q8: "What's CRAG? How does it improve robustness?"
Weak answer: "CRAG evaluates retrieval quality." (No branching logic.)
Strong answer: "CRAG adds a retrieval quality evaluator before generation. Lightweight classifier or LLM returns: Correct, Ambiguous, or Incorrect. Correct: proceed with standard RAG, optionally decompose. Ambiguous: refine query, re-retrieve, or supplement with web search. Incorrect: discard docs, fall back to web search or 'I don't have sufficient information.' Naive RAG blindly passes bad retrieval to the LLM—hallucination city. CRAG gates: poor retrieval triggers a different path. One extra inference per query. Worth it when corpus is incomplete or queries often fall outside the knowledge base."
Quick Fire Round
Flashcard-style. Answer out loud before peeking.
Q: What does RRF stand for and why do we use it?
A: Reciprocal Rank Fusion. Merges ranked lists from different retrievers without score normalization—uses rank positions only. No tuning.
Q: Bi-encoder vs cross-encoder—what's the key difference?
A: Bi-encoder encodes query and doc separately; cross-encoder sees them together in one forward pass. Cross-encoder more accurate but too slow for full corpus.
Q: HyDE in one sentence.
A: Generate a hypothetical answer to the query, embed it, search with that—bridges query-document vocabulary gap.
Q: When does BM25 beat vector search?
A: Exact term matching—IDs, codes, proper nouns, technical terms. Vector search can't align sparse/unique strings well.
Q: What's the typical reranking setup?
A: Retrieve 50–100 with bi-encoder, rerank to top 5–10 with cross-encoder.
Q: Step-back prompting fixes what?
A: Over-specific queries. Rewrite to broader question before retrieval. ~21% of such errors.
Q: Parent-document retrieval: index what, return what?
A: Index small chunks (precise retrieval). Return parent document (full context).
Q: Agentic RAG vs naive RAG—one key difference.
A: Agent decides when/what to retrieve and whether to iterate. Naive RAG: fixed pipeline every time.
Q: CRAG's three branches?
A: Correct (proceed), Ambiguous (refine/re-retrieve), Incorrect (fallback/web search).
Q: GraphRAG beats vector when?
A: Relational queries, multi-hop reasoning, entity-centric domains, connection synthesis.
Q: Multi-query: what's the trade-off?
A: Higher recall (different phrasings surface different docs). Cost: multiple retrieval calls, potential noise.
Q: Query decomposition helps with what?
A: Multi-hop questions. Break into sub-queries, retrieve for each, merge. +36% MRR, +11% F1.
Q: What's the typical α split for linear combination hybrid search?
A: 0.7 vector, 0.3 keyword. Tune on eval set—ID-heavy domains may need 0.5/0.5.
Q: RAPTOR improves what kind of questions?
A: Synthesis questions—"What are the main themes?" High-level summaries integrate info across chunks.
Q: UDCG improves on traditional metrics how?
A: Uses LLM-oriented positional discount; quantifies both positive contributions and distracting docs. Up to 36% better correlation with end-to-end answer accuracy.
From Your Experience
These prompts are for reflection, drawing on your experience building AI email booking systems and AI platforms at Maersk:
Did you implement any advanced RAG patterns in the email system? Hybrid search? Reranking?
Consider whether the email system used retrieval at all—e.g., for matching incoming emails to templates, policies, or past conversations. If so, did you use vector search only, or did you combine with keyword matching (e.g., for booking references, order IDs)? Did you add a reranking step to improve precision before passing context to the LLM?
How did you handle retrieval quality issues? What was your debugging process?
When retrieval returned irrelevant results, how did you diagnose the cause? Did you log queries, retrieved chunks, and outcomes? Did you build an evaluation set? Did you try query transformation (e.g., extracting key terms from emails before retrieval)? How did you balance recall (finding the right email or policy) with precision (avoiding noise)?
Did you use agentic RAG—where the agent decides whether to retrieve more?
In the AI platform, did the system ever need to make retrieval decisions dynamically? For example, if a user's question was ambiguous, did the agent ask clarifying questions or retrieve from multiple sources? Did the agent evaluate whether the first retrieval was sufficient before generating a response? If not, could agentic RAG have helped in any scenarios?
Key Takeaways (Cheat Sheet)
| Topic | Key Point |
|---|---|
| Hybrid Search | BM25 + vector search. RRF: score = Σ 1/(k+rank), no tuning. Linear combo: 0.7 vector + 0.3 keyword typical. Use when exact terms (IDs, names) matter. |
| Reranking | Two-stage: retrieve 50 with bi-encoder, rerank to top 5 with cross-encoder. Cross-encoders see query+doc together → more accurate. Cohere, Jina, ms-marco. +10–25% accuracy, +100–500ms latency. |
| HyDE | Generate hypothetical answer → embed it → search. Bridges query-document vocabulary gap. Helps when queries are short/abstract. |
| Multi-Query | Generate query variations → retrieve for each → merge, dedupe. Increases recall. Cost: multiple retrievals. |
| Step-Back | Rewrite specific query to broader one before retrieval. "Boiling point at 5000ft" → "How does altitude affect boiling point?" Fixes ~21% of over-specific errors. |
| Query Decomposition | Break multi-hop question into sub-queries → retrieve for each → merge. +36% MRR, +11% F1 on multi-hop. |
| Parent-Document | Index small chunks, return parent. Small chunks retrieve well; parent gives context. Store child→parent mapping. |
| Contextual Compression | LLM extracts relevant sentences from each chunk post-retrieval. Reduces noise. Adds latency/cost. |
| Agentic RAG | Agent decides: retrieve? what query? retry? Use for variable complexity, multi-hop, unreliable retrieval. LangGraph. |
| Self-RAG | Model has reflection tokens: retrieve? relevant? supported? Self-critique loop. Requires training. |
| CRAG | Evaluate retrieval: correct → use; ambiguous → refine; incorrect → web search/fallback. Adds robustness. |
| RAPTOR | Hierarchical summarization: cluster chunks → summarize → repeat. Retrieve at multiple abstraction levels. Good for synthesis. |
| GraphRAG | Entities + relationships → knowledge graph. Graph traversal + vector. Best for relational, multi-hop, holistic questions. |
| Multi-Modal RAG | Text + images + tables. OCR, captioning, CLIP, or multimodal LLMs. Tables in PDFs are hard; layout models help. |
| Evaluation | Precision@k, Recall@k, MRR, NDCG. Build labeled (query, doc) pairs. LLM-as-judge scales but may bias. UDCG (2025) improves correlation with answer accuracy. |
Further Reading (Optional)
- Corrective Retrieval Augmented Generation (Yan et al., 2024) — CRAG method
- Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection (Asai et al., ICLR 2024)
- RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval (Sarthi et al., ICLR 2024)
- Microsoft GraphRAG — Knowledge graph RAG
- LangGraph Agentic RAG Documentation
- Cohere Rerank API
- Reciprocal Rank Fusion - Elasticsearch