AI Engineer Prep

Session 3: RAG Fundamentals — The One Topic That Lands You the Job

Here's the thing: RAG appears in all six of your target job descriptions. Not "some." All of them. If you're going to nail one session, make it this one. By the end, you'll think about RAG differently—not as a buzzword, but as a system you can design on a whiteboard while an interviewer watches.

Let's get into it.


Start With the Analogy (You'll Thank Me in the Interview)

Before we define anything, picture this: RAG is giving the model an open-book exam instead of testing its memory.

When you cram for a closed-book test, you're limited to what you memorized. Miss one fact? You're sunk. LLMs have the same problem—they only "know" what was in their training data, and that knowledge has a cutoff date. Ask GPT-4 about something that happened last week? It might hallucinate something plausible but wrong.

RAG flips the script. Instead of relying on the model's memory, we hand it a stack of relevant documents at query time. "Here's the context. Answer from this." The model doesn't need to have seen the information before—it just needs to read what we give it and respond. That's the entire idea.

Now let's build the rest of the picture with analogies. These will stick in your head when you're under pressure.


The RAG Mental Model: Six Analogies That Explain Everything

Chunking = Cutting a book into index cards. Too small and you lose context—"refunds are processed within" tells you nothing without "5–7 business days." Too big and you can't find what you need—a 2000-word chunk about "company policies" mixes refunds, vacation, and parking, so your query about refunds retrieves noise. The art is finding the right card size.

Embeddings = GPS coordinates for meaning. The word "king" and "queen" live in the same neighborhood—they're semantically close. "King" and "bicycle" are across town. Embeddings turn text into vectors in a high-dimensional space where similar meanings cluster together. That's how we find "relevant" without keyword matching.

Vector store = A library where books are shelved by meaning instead of alphabetically. In a normal library, "Refund Policy" and "Return Guidelines" might be on different floors. In a vector store, they're next to each other because they mean similar things. The shelving system is the embedding space.

Similarity search = "Find me books shelved near THIS spot." You hand the librarian a query (embedded), and they return the chunks whose vectors are closest to yours. That's retrieval.

Metadata filtering = Asking the librarian: "Find books about refunds, but only from the HR section, published after 2023." You narrow the search before or after the similarity sweep. Pre-filter: search only in the HR section. Post-filter: get top results, then throw out anything that doesn't match. Both work; pre-filter is usually better when you have strict constraints.

The indexing pipeline = Building the library. One-time (or periodic) heavy lift. Load documents, chunk them, embed them, store them. This happens offline. Nobody's waiting.

The retrieval pipeline = Using the library. User asks a question. You embed the query, search the vector store, get top-k chunks, maybe rerank them, then pass them to the LLM. This happens every single query. Latency matters.

Got it? Good. Now let's go deeper—and make sure you can explain every piece of this in an interview.


Why RAG Dominates Enterprise AI (And Your Interview)

RAG wasn't invented yesterday. Lewis et al. published the original paper in 2020 at NeurIPS: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. The key insight: instead of stuffing all knowledge into the model's weights, store it in an external index and retrieve only what's relevant for each query.

Today, RAG is the default pattern for enterprise AI. Why?

  1. Knowledge stays fresh — Documents change? Re-index. The model stays the same. No retraining.
  2. Citations and traceability — Users can verify answers against source documents. "Where did you get that?" → "Section 3.2 of the HR handbook."
  3. Fewer hallucinations — You constrain the model to answer from context. It can still hallucinate (we'll get to that), but you've stacked the deck in your favor.
  4. Proprietary knowledge — Your internal docs, customer data, confidential policies—none of that was in the model's training. RAG makes it usable.
  5. Cost-effective — No fine-tuning compute. No continuous retraining. Just retrieval + generation.

RAG vs. fine-tuning: Use RAG when your knowledge is large, frequently updated, or proprietary; when you need citations; when you want to ground answers in external data. Use fine-tuning when you need consistent custom behavior—tone, format, domain terminology—that's hard to achieve through prompting alone. Fine-tuning teaches the model how to behave; RAG gives it what to say. In production, you often combine both.


Interview Insight

RAG is the #1 topic in AI engineer interviews in 2026. If you can only prepare for one thing, prepare for RAG. They will ask you to design a RAG system on the whiteboard.


The Three-Stage RAG Pipeline — Index, Retrieve, Generate

A RAG system has three distinct pipelines. Get this diagram in your head—you'll draw it in interviews.

flowchart LR
    subgraph Index["Indexing Pipeline"]
        direction TB
        docs[Documents]
        load[Document Loaders]
        extract[Text Extraction]
        chunk[Chunking]
        embed[Embedding Model]
        store[(Vector Store)]
        docs --> load --> extract --> chunk --> embed --> store
    end
    
    subgraph Retrieve["Retrieval Pipeline"]
        direction TB
        query[User Query]
        qEmbed[Query Embedding]
        search[Similarity Search]
        rerank[Rerank Optional]
        topK[Top-K Chunks]
        query --> qEmbed --> search --> rerank --> topK
    end
    
    subgraph Gen["Generation Pipeline"]
        direction TB
        prompt[Prompt Context plus Query]
        llm[LLM]
        answer[Answer]
        topK --> prompt --> llm --> answer
    end
    
    store -.->|Vector DB| search

Indexing (offline): Load docs → extract text → chunk → embed → store in vector DB. Runs periodically or on-demand. Decoupled from user queries.

Retrieval (online): User asks → embed query → similarity search → optional rerank → return top-k chunks. Latency is critical. Vector DBs use ANN (approximate nearest neighbor) to trade a bit of accuracy for speed.

Generation (online): Combine retrieved chunks + query into a prompt → send to LLM → get answer. The model is instructed to cite sources, avoid speculation, and say "I don't know" when context is insufficient.

Why this matters in production: At Maersk, when you built the email booking system, the indexing pipeline might have run when new carrier policies or rate tables were added. The retrieval pipeline ran every time a user asked about a booking—or every time the system needed to ground extraction in known formats. The generation pipeline produced the final response. Understanding these as separate stages lets you optimize each one independently—and explain trade-offs in an interview. In a customer support chatbot, indexing runs nightly when new FAQs are published; retrieval runs on every user message; generation produces the reply. Each stage has different latency requirements, cost profiles, and failure modes. Treat them as distinct systems.


Chunking: Where Most RAG Systems Go Wrong

Chunking is one of the most consequential design choices in RAG. Get it wrong, and retrieval is garbage. Get it right, and everything else gets easier.

The Chunking Comparison — Same Document, Three Strategies

flowchart TB
    subgraph Doc["Same Document: Refund Policy"]
        doc[Our refund policy allows returns within 30 days. You must provide a valid receipt. Refunds are processed within 5-7 business days. For international orders shipping costs are non-refundable.]
    end
    
    subgraph Fixed["Fixed-Size 50 chars"]
        f1["Chunk1: Our refund policy allows returns within 30 days. You"]
        f2["Chunk2: must provide a valid receipt. Refunds are proces"]
        f3["Chunk3: sed within 5-7 business days. For international"]
        f1 -.->|"Cuts mid-word"| f2
        f2 -.->|"Loses context"| f3
    end
    
    subgraph Recursive["Recursive by paragraph"]
        r1["Chunk1: Our refund policy allows returns within 30 days. You must provide a valid receipt. Refunds are processed within 5-7 business days."]
        r2["Chunk2: For international orders shipping costs are non-refundable."]
        r1 -->|"Clean boundary"| r2
    end
    
    subgraph Semantic["Semantic by meaning shift"]
        s1["Chunk1: Refund policy + receipt + 5-7 days"]
        s2["Chunk2: International orders exception"]
        s1 -->|"Topic boundary"| s2
    end
    
    Doc --> Fixed
    Doc --> Recursive
    Doc --> Semantic

Fixed-size chunking — Splits by character or token count. Simple. Fast. Problem: ignores semantic boundaries. You might cut mid-sentence: "refunds are proces" + "sed within 5-7 days." A query about "refund processing time" could retrieve a chunk that has neither the full question nor the full answer. Use for prototypes or uniform docs (logs, code). Not for natural language where coherence matters.

Recursive character splitting — Tries to split at natural boundaries first: paragraphs (double newline), then lines, then words, then characters. Preserves meaning. Chunk sizes vary. This is the recommended default. Typical config: 512 tokens, 50-token overlap. The overlap is insurance—if a key sentence spans two chunks, it appears in both.

Semantic chunking — Uses embeddings to find where meaning shifts. Compute similarity between adjacent segments; chunk where it drops. Produces coherent chunks but costs more (embed every sentence during indexing). Research in 2025 shows gains over recursive are inconsistent. Best for homogeneous corpora (technical docs) where topic boundaries are clear.

Document-structure-aware chunking — Split by markdown headers, HTML sections, PDF layout. Preserves hierarchy. Attach metadata: "Section 3.2," "Chapter 5." Essential for manuals, legal docs, academic papers. Tables need special handling—Unstructured, OpenDataLoader—or you'll scramble row-column relationships.

Chunk Overlap: It's Not Wasted Space

Aha Moment

Chunk overlap isn't wasted space—it's insurance. Without overlap, a sentence split across two chunks loses its meaning in BOTH chunks.

A 50–200 character overlap is common. If "refunds are processed within 5–7 days" spans the boundary between chunk A and B, overlap ensures it appears in both. Retrieve either chunk, you get the answer. Typical: 10–20% of chunk size.

Chunk Size: The Trade-Off Nobody Wants to Admit

Interview Insight

When they ask "how do you choose chunk size?", they want to hear TRADE-OFFS, not a number. The right answer depends on the use case.

Small chunks (200–500 chars, ~50–128 tokens): precise retrieval, each chunk is focused. But the LLM gets fragments—maybe not enough context to answer fully. Large chunks (1000–2000 chars): more context, but noisier—multiple topics, similarity score is an average. Sweet spot: 512 tokens for most apps. Short FAQs: 256. Technical docs: 512–1024. Legal/research: 1024–2048. Experiment. Measure.

Why this matters in production: In document search for customer support, small chunks might retrieve the exact FAQ sentence. In email extraction at Maersk, you might need larger chunks if the system grounds extraction in full booking templates—the template structure matters. In legal document RAG, section-level chunks with metadata (section, subsection) let you filter and cite precisely. For email extraction specifically: if you're matching incoming emails to known booking formats, the "chunk" might be an entire template or a structured schema—not a 512-token snippet. The retrieval goal is different: you're finding the right template to apply, not the right paragraph to quote. That changes the chunking strategy entirely.


Embeddings: GPS Coordinates for Meaning

Embedding models convert text into dense vectors. "King" and "queen" end up close; "king" and "bicycle" end up far apart. The choice of model affects quality, cost, latency, and privacy.

OpenAI text-embedding-3-small — 1536 dimensions, cheap ($0.02/1M tokens), fast, good for most use cases. text-embedding-3-large — 3072 dimensions, better quality, more expensive ($0.13/1M tokens). Both support Matryoshka Representation Learning (MRL): request fewer dimensions (256, 512) via the API—truncation preserves most semantic info. Smaller vectors, less storage, faster search.

Cohere embed-v3/v4input_type parameter: search_document for indexing, search_query for queries. Documents and queries are different—long vs short, descriptive vs interrogative. Using the right type improves relevance. Multilingual (100+ languages), competitive on MTEB.

Open-source (BGE, E5, GTE) — Run locally. No API cost. Air-gapped deployments. Trade-off: you host and maintain. BGE-M3 is strong, multilingual, runs with Sentence-Transformers.

How to choose: MTEB leaderboard on Hugging Face. But validate on your corpus. Run A/B tests. Embedding asymmetry matters—use the correct input type when the model supports it.

Why this matters in production: For email extraction, you might embed booking templates as search_document and incoming email intents as search_query. For customer support RAG, multilingual embeddings (Cohere, BGE-M3) let English queries retrieve Spanish docs. For confidential docs, open-source keeps everything on-prem. One more scenario: if your RAG system serves multiple product lines or departments, you might use the same embedding model but different collections or metadata filters. The embedding space is shared; the retrieval scope is narrowed by metadata. That's a common production pattern—one model, many filtered views.


Embedding and Similarity Search Flow

flowchart LR
    subgraph IndexFlow["Index Time"]
        docChunk[Document Chunk]
        embedModel[Embedding Model]
        vector[1536-dim Vector]
        docChunk --> embedModel --> vector
    end
    
    subgraph Store["Vector Store"]
        vs[(Vectors Indexed)]
        vector --> vs
    end
    
    subgraph QueryFlow["Query Time"]
        userQuery[User Query]
        qEmbed[Same Embedding Model]
        qVector[Query Vector]
        userQuery --> qEmbed --> qVector
    end
    
    subgraph Search["Similarity Search"]
        qVector --> ann[ANN Search]
        vs --> ann
        ann --> topK[Top-K Chunks]
    end

Cosine similarity — Measures angle between vectors. Range -1 to 1. Ignores magnitude. Standard for text—document length shouldn't affect similarity. Most embedding models produce normalized vectors; cosine = dot product in that case.

Dot product — Sum of element-wise products. When normalized, equals cosine. When not, favors longer vectors. Use when magnitude matters (e.g., recommendation systems).

Euclidean distance (L2) — Straight-line distance. Smaller = more similar. Sensitive to magnitude. More common for image embeddings.

Aha Moment

Cosine similarity doesn't measure "understanding." It measures angle between vectors. Two sentences can be semantically different but have similar cosine similarity because they share vocabulary. That's why reranking exists.

ANN (Approximate Nearest Neighbor) — Exact search is O(n)—compare to every vector. Too slow for millions. ANN (HNSW, IVF) builds an index, trades a bit of accuracy for 10–100x speed. Recall typically 95–99%. For RAG, getting the 8th-nearest chunk instead of the 7th rarely changes the answer.


Vector Stores: Picking the Right Library

flowchart TB
    start[Which Vector Store?]
    havePostgres[Already use Postgres?]
    wantZeroOps[Want zero ops?]
    scale[Scale?]
    needHybrid[Need hybrid search?]
    
    start --> havePostgres
    havePostgres -->|Yes, moderate scale| pgvector[pgvector]
    havePostgres -->|No| wantZeroOps
    wantZeroOps -->|Yes| pinecone[Pinecone]
    wantZeroOps -->|No| scale
    scale -->|Millions, performance critical| qdrant[Qdrant]
    scale -->|Billions| milvus[Milvus]
    needHybrid -->|Yes| weaviate[Weaviate]
    
    prototype[Prototyping?] --> chroma[Chroma]

Qdrant — Rust-based, fast (p95 30–40ms for 1M vectors). Rich metadata filtering, quantization. Self-hosted or cloud. Strong docs. Default for production when you want control. ~$30/month cloud.

Pinecone — Fully managed, serverless. Zero ops. Auto-scaling. Free tier 100K vectors. At scale: $70–200/month for 10M vectors. Pick when simplicity > cost.

Chroma — Python-native, lightweight. In-memory or persistent. Prototyping and demos. Not for production scale.

pgvector — Postgres extension. Add vector column, run similarity search in SQL. Unified storage. Good for millions of vectors. Pick when you have Postgres and moderate scale.

Weaviate — Hybrid search (vector + keyword), GraphQL API, multi-modal. ~$25/month. Pick when you need hybrid out of the box.

Milvus — Distributed, billions of vectors. Complex to operate. Pick when scale is the constraint and you have the team.

Decision framework: pgvector if you have Postgres. Qdrant for performance. Pinecone for zero-ops. Chroma for prototyping. Weaviate for hybrid. Milvus for billion-scale.

Why this matters in production: At Maersk, if you already ran Postgres for booking data, pgvector could store document embeddings alongside—unified ops. If you needed sub-50ms retrieval for a high-traffic support chatbot, Qdrant. If you were a small team shipping fast, Pinecone. Another consideration: multi-tenancy. If each customer has their own document set, you might use a single vector store with tenant_id in metadata and pre-filter every query—or separate collections per tenant. The choice affects isolation, cost, and query complexity. Interviewers love when you bring up multi-tenancy; it shows you've thought beyond the happy path.


Vector Store Query Flow — The Full Retrieval Path

flowchart TB
    query[User Query]
    embed[Embed Query]
    ann[ANN Search]
    filter[Metadata Filter]
    rerank[Rerank]
    return[Return Top-K]
    
    query --> embed
    embed --> ann
    ann --> filter
    filter --> rerank
    rerank --> return

In practice: embed query → run ANN search (optionally with pre-filter) → apply metadata filter (pre or post) → optionally rerank with cross-encoder → return top-k chunks. Each step is a lever. Pre-filter narrows the corpus before search. Reranking rescores top-k for better precision. Metadata enables "only HR, only 2024."


Metadata Filtering: The Librarian's Secret Weapon

Metadata enriches chunks: source, page, section, date, category, author. Add at index time. Filter at query time.

Pre-filtering — Apply metadata constraints first, then vector search on the subset. "WHERE category = 'refund_policy' AND date > '2024-01-01'" → then similarity search. You never retrieve wrong-department docs.

Post-filtering — Run similarity search first, then filter results. Simpler but can return fewer than k if many top chunks fail the filter. Use when metadata is secondary to relevance.

Why this matters in production: Enterprises have policies, product docs, support articles, wikis. Without metadata, "refund policy" might hit product manuals. Metadata enables multi-tenant RAG (filter by tenant), temporal filtering (current policies only), access control (user permissions). Better citations: "Source: HR Handbook, Section 3.2, Jan 2024."


Document Loaders: Where Garbage In Starts

RAG quality starts at extraction. Bad extraction → bad chunks → bad retrieval → wrong answers.

PDF — PyPDF: fast, fails on scanned PDFs, scrambles tables. Unstructured: layout-aware, OCR, tables. OpenDataLoader (2025): markdown output, ~93% table accuracy. Use a table-aware parser when your docs have them.

HTML — BeautifulSoup. Preserve structure. Strip scripts/styles. Respect robots.txt.

Emails — MIME parsing. Body, attachments, headers. Attachments need separate loaders.

Word, Excel, Markdown — python-docx, pandas, split by headers. Each format has quirks.

Why this matters in production: For Maersk email booking, you'd need loaders for PDF rate sheets, Word templates, and email bodies. Poor extraction of carrier names or port codes would poison the whole pipeline. Invest in extraction quality for your dominant formats. A common mistake: assuming "good enough" extraction is fine. It's not. If your PDF loader scrambles tables, your chunks will mix row data across columns—and retrieval will return nonsense. Fix extraction first. Everything downstream depends on it.


Two Truth Bombs You Need to Internalize

Aha Moment

RAG doesn't solve hallucinations—it reduces them. The model can STILL ignore the context you provide and hallucinate. That's why you need grounding checks and output validation even with RAG.

Aha Moment

The quality of your RAG system is determined by retrieval, not generation. If you retrieve the wrong chunks, even GPT-4 will give you a wrong answer. Garbage in, garbage out—but with confident tone.


The "7 Chunks Irrelevant" Question — The Senior vs Junior Test

Interview Insight

The real test: "Your RAG system retrieves 10 chunks but 7 are irrelevant. What do you do?" This separates senior from junior. Junior: "use a better embedding model." Senior: "I would diagnose whether it's a chunking problem, an embedding problem, a query problem, or a data quality problem, then fix the root cause."

Diagnosis first. Inspect the chunks. Right document, wrong section? Chunking too large. Wrong document entirely? Embedding or query issue.

Fixes, in order of impact:

  1. Chunking — Try smaller chunks (256, 512), more overlap. Chunks mixing topics? Split by structure.
  2. Query expansion — Queries are short; docs are long. Use LLM to expand query, or HyDE (hypothetical document embeddings).
  3. Reranking — Cross-encoder or Cohere Rerank on top-10 → top-3. +100–200ms latency, big precision gain.
  4. Hybrid search — BM25 + vector, merge with RRF. Catches exact matches (IDs, codes) that vectors miss.
  5. Metadata filtering — Restrict to relevant category/section before search.
  6. Embedding model — Try different model. Validate on your corpus—MTEB is a guide.
  7. Reduce k — If 10 chunks and only 3 are useful, retrieve 5. Less noise.

Code Examples

Full RAG Pipeline with LangChain and Qdrant

"""
Complete RAG pipeline: load documents → chunk → embed → store in Qdrant → query → generate.
Requires: pip install langchain langchain-openai langchain-qdrant qdrant-client
"""
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
 
# 1. Load documents
loader = TextLoader("company_docs.txt")
documents = loader.load()
 
# 2. Chunk with recursive splitting
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    length_function=len,
    separators=["\n\n", "\n", " ", ""],
)
chunks = splitter.split_documents(documents)
 
# 3. Embed and store in Qdrant
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
client = QdrantClient(":memory:")  # Use URL for production
client.create_collection(
    collection_name="rag_docs",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)
 
vector_store = QdrantVectorStore(
    client=client,
    collection_name="rag_docs",
    embedding=embeddings,
)
vector_store.add_documents(chunks)
 
# 4. Create retriever and RAG chain
retriever = vector_store.as_retriever(search_kwargs={"k": 5})
 
template = """Answer the question based only on the following context.
If the context does not contain the answer, say "I don't have that information."
 
Context:
{context}
 
Question: {question}
Answer:"""
 
prompt = ChatPromptTemplate.from_template(template)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
 
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)
 
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
 
# 5. Query
answer = rag_chain.invoke("What is the refund policy?")
print(answer)

Chunking Comparison: Fixed-Size, Recursive, and Semantic

"""
Compare chunking strategies on the same document.
Shows how different strategies produce different chunk boundaries.
"""
from langchain_text_splitters import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
)
 
sample_text = """
Our refund policy allows returns within 30 days of purchase.
You must provide a valid receipt. Refunds are processed within 5-7 business days.
 
For international orders, shipping costs are non-refundable.
Please contact support@company.com for assistance.
 
Our warranty covers manufacturing defects for one year.
This does not include normal wear and tear.
"""
 
# Fixed-size: 100 chars, no overlap
fixed_splitter = CharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=0,
    separator="",
)
fixed_chunks = fixed_splitter.split_text(sample_text)
print("=== Fixed-size (100 chars) ===")
for i, c in enumerate(fixed_chunks):
    print(f"Chunk {i+1}: {repr(c[:80])}...")
 
# Recursive: 150 chars, 20 char overlap
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=20,
    separators=["\n\n", "\n", " ", ""],
)
recursive_chunks = recursive_splitter.split_text(sample_text)
print("\n=== Recursive (150 chars, 20 overlap) ===")
for i, c in enumerate(recursive_chunks):
    print(f"Chunk {i+1}: {repr(c[:80])}...")
 
# Semantic chunking would require embeddings - here we simulate by splitting on double newline
semantic_chunks = [p.strip() for p in sample_text.split("\n\n") if p.strip()]
print("\n=== Semantic (by paragraph) ===")
for i, c in enumerate(semantic_chunks):
    print(f"Chunk {i+1}: {repr(c[:80])}...")

Embedding Generation and Cosine Similarity from Scratch

"""
Generate embeddings and compute cosine similarity.
No LangChain - direct API and numpy.
"""
from openai import OpenAI
import numpy as np
 
client = OpenAI()
 
def get_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]:
    response = client.embeddings.create(input=[text], model=model)
    return response.data[0].embedding
 
def cosine_similarity(a: list[float], b: list[float]) -> float:
    """Cosine similarity: (A · B) / (||A|| × ||B||). Range [-1, 1]."""
    a_np = np.array(a, dtype=np.float32)
    b_np = np.array(b, dtype=np.float32)
    dot = np.dot(a_np, b_np)
    norm_a = np.linalg.norm(a_np)
    norm_b = np.linalg.norm(b_np)
    if norm_a == 0 or norm_b == 0:
        return 0.0
    return float(dot / (norm_a * norm_b))
 
def dot_product_similarity(a: list[float], b: list[float]) -> float:
    """Dot product - same as cosine when vectors are normalized."""
    return float(np.dot(np.array(a), np.array(b)))
 
def euclidean_distance(a: list[float], b: list[float]) -> float:
    """L2 distance - smaller is more similar."""
    return float(np.linalg.norm(np.array(a) - np.array(b)))
 
# Example: compare query to documents
query = "What is your return policy?"
docs = [
    "We allow returns within 30 days with a receipt.",
    "Our office hours are 9am to 5pm EST.",
    "Refunds are processed within 5-7 business days.",
]
 
query_emb = get_embedding(query)
doc_embs = [get_embedding(d) for d in docs]
 
print("Query:", query)
print("\nCosine similarities (higher = more similar):")
for doc, emb in zip(docs, doc_embs):
    sim = cosine_similarity(query_emb, emb)
    print(f"  {sim:.4f} | {doc[:50]}...")

Metadata Filtering with Qdrant

"""
Metadata filtering: add metadata at index time, filter at query time.
"""
from qdrant_client import QdrantClient
from qdrant_client.http import models
from qdrant_client.http.models import Distance, VectorParams, Filter, FieldCondition, MatchValue
 
# Create client and collection with metadata
client = QdrantClient(":memory:")
client.create_collection(
    collection_name="docs_with_metadata",
    vectors_config=VectorParams(size=4, distance=Distance.COSINE),
)
 
# Upsert points with metadata (in practice, vectors come from embedding model)
client.upsert(
    collection_name="docs_with_metadata",
    points=[
        models.PointStruct(
            id=1,
            vector=[0.1, 0.2, 0.3, 0.4],
            payload={"source": "hr_handbook", "section": "leave", "year": 2024},
        ),
        models.PointStruct(
            id=2,
            vector=[0.2, 0.3, 0.4, 0.5],
            payload={"source": "hr_handbook", "section": "refund", "year": 2024},
        ),
        models.PointStruct(
            id=3,
            vector=[0.15, 0.25, 0.35, 0.45],
            payload={"source": "product_manual", "section": "warranty", "year": 2023},
        ),
    ],
)
 
# Query with metadata filter: only hr_handbook, year 2024
results = client.search(
    collection_name="docs_with_metadata",
    query_vector=[0.18, 0.28, 0.38, 0.48],
    query_filter=Filter(
        must=[
            FieldCondition(key="source", match=MatchValue(value="hr_handbook")),
            FieldCondition(key="year", match=MatchValue(value=2024)),
        ]
    ),
    limit=3,
)
 
print("Filtered results (hr_handbook, 2024 only):")
for r in results:
    print(f"  id={r.id}, score={r.score:.4f}, payload={r.payload}")

Conversational Interview Q&A: Weak vs Strong Answers

Q1: "Design a RAG system for internal company documentation. Walk through every design choice."

Weak answer: "I'd use LangChain, chunk the docs, embed with OpenAI, put them in Pinecone, and use GPT to generate answers."

Strong answer: "I'd start by clarifying requirements: who are the users, what document types, query volume, access control, latency budget. For loading, I'd use Unstructured for PDFs—handles tables and layout—and format-specific loaders for wikis and Confluence. For chunking, recursive splitting at 512 tokens with 50-token overlap as a baseline, and I'd explore structure-aware chunking by markdown headers. I'd A/B test 256 vs 512 vs 1024 and measure retrieval precision on a golden set. For embeddings, I'd choose based on privacy: OpenAI or Cohere if data can leave the network, BGE-M3 if on-prem. For the vector store, pgvector if we have Postgres, Qdrant for performance, Pinecone for zero-ops. I'd add metadata—source, section, last_updated—for filtering. Retrieve top-10, optionally rerank to top-5. Prompt: answer only from context, cite sources, say 'I don't know' when insufficient. Temperature 0. For frequent updates, incremental indexing—re-chunk and re-embed only changed docs, upsert by stable ID."


Q2: "How do you choose chunk size? What are the trade-offs?"

Weak answer: "512 tokens is standard."

Strong answer: "It's a trade-off between retrieval precision and context sufficiency. Small chunks—256 tokens—give precise retrieval; each chunk is focused. But the LLM gets fragments and might lack context. Large chunks—1024 tokens—give more context but mix topics; similarity scores average over multiple ideas. The sweet spot depends on the corpus. For short FAQs, 256 works. For technical docs, 512–1024. I'd run experiments: chunk at 256, 512, 1024; build a golden set of 20–50 Q&A pairs; measure retrieval precision and end-to-end answer quality. Choose the size that maximizes the target metric. There's no universal optimum."


Q3: "Compare Qdrant, Pinecone, and pgvector. When would you pick each?"

Weak answer: "Pinecone is easier, Qdrant is faster, pgvector is for Postgres."

Strong answer: "pgvector: Postgres extension, add a vector column, SQL interface. Advantage—you already have Postgres, unified storage, transactional consistency. Good for millions of vectors. Pick when we have Postgres and moderate scale, want to minimize ops. Qdrant: dedicated vector DB, Rust, 30–40ms p95 for 1M vectors. Rich filtering, quantization. Pick when we need production performance and control. My default for performance-critical RAG. Pinecone: fully managed, serverless. Zero ops. Trade-off: cost at scale. Pick when operational simplicity is top priority, small team, or willing to pay for managed. Summary: pgvector for simplicity and existing Postgres; Qdrant for performance; Pinecone for zero-ops."


Q4: "Your RAG retrieves 10 chunks but 7 are irrelevant. What do you do?"

Weak answer: "Use a better embedding model."

Strong answer: "I'd diagnose first. Inspect the chunks—right document, wrong section? Chunking too large. Wrong document entirely? Embedding or query issue. Then: (1) Try smaller chunks and more overlap. (2) Query expansion or HyDE—queries are short, docs are long. (3) Add reranking—cross-encoder on top-10 to top-3. (4) Hybrid search—BM25 + vector, RRF merge—for exact matches. (5) Metadata filtering to restrict category. (6) Try a different embedding model, validate on our corpus. (7) Reduce k—if only 3 of 10 are useful, retrieve 5. Fix the root cause, not the symptom."

def advanced_retrieve(query: str):
    # 1) semantic (HyDE) + filter
    hyde_doc = generate_hyde_document(query)
    emb = embed_model.encode(hyde_doc, normalize_embeddings=True)
    vec_results = retrieve_with_filters(emb, k=15, filters={"env": "prod"})

    # 2) BM25 (same filter baked into your index selection or query)
    bm25_results = search_bm25(query, k=15)

    # 3) hybrid fusion
    fused = rrf_fuse(bm25_results, vec_results, top_k=8)

    # 4) cross-encoder rerank
    docs = [t for (_, _, t) in fused]
    reranked = rerank_with_cross_encoder(query, docs, top_k=3)

    return reranked  # these 3 go into your RAG context window

Q5: "How do you handle frequently updated or deleted documents?"

Weak answer: "Re-index everything when something changes."

Strong answer: "For updates: incremental indexing. Track document versions by hash or timestamp. When a doc changes, re-chunk and re-embed only that doc. Upsert into the vector store with stable IDs—document_id + chunk_index. Some stores support delete-by-filter: delete all chunks where source=doc_id, then insert new set. Run on a schedule or on change events. For deletes: delete by metadata filter if supported. Otherwise track deleted IDs and post-filter—less ideal. For soft deletes, add deleted: false metadata and filter at query time. At scale, use content-addressed storage—skip unchanged docs by hash."

# create or update a doc
doc = Document(
    id="kb_1234",
    text="How to rotate logs in Kubernetes...\nStep 1 ...",
    updated_at=time.time(),
)
index_document(doc)

# later the doc changes
doc.text = "UPDATED: How to rotate logs in Kubernetes...\nStep 1 ..."
doc.updated_at = time.time()
index_document(doc)  # only this doc is re-chunked + upserted

# soft delete
soft_delete_document("kb_1234")

# query (won't see kb_1234 because deleted=True)
results = query_with_filters("rotate logs kubernetes", base_filter={"product": "k8s"})


Q6: "What are the limitations of RAG? When does it fail?"

Weak answer: "It can hallucinate and retrieval isn't perfect."

Strong answer: "Several failure modes. (1) Retrieval quality caps answer quality—garbage in, garbage out. (2) Query sensitivity—'refund policy' vs 'return policy' can change results. Mitigate with query expansion, multi-query. (3) Lost in the middle—too many chunks dilute attention. Retrieve fewer, higher-quality chunks; place best at start or end. (4) Scale and latency—retrieval adds 50–200ms. (5) Multimodal—tables, images need specialized extraction. (6) Stale or conflicting docs—need lifecycle management. RAG fails when retrieval consistently returns irrelevant results and we can't improve it, when multi-hop reasoning is needed across many docs, when latency is too strict, or when knowledge is too dynamic to keep fresh. Then we might need fine-tuning, knowledge graphs, or hybrid approaches."


Q7: "How would you build RAG for multi-language documents?"

Weak answer: "Use a multilingual embedding model."

Strong answer: "Several layers. Embedding: choose multilingual—OpenAI large, Cohere, BGE-M3. Validate cross-lingual retrieval on our language mix. Chunking: token-based more robust than character-based for languages without spaces. Query handling: (1) embed query as-is if model aligns languages well; (2) translate to canonical language if needed; (3) multi-query in multiple languages, merge with RRF for recall. Generation: instruct 'answer in same language as question.' Metadata: add language field at index time for filtering and evaluation. Build golden set per language; monitor for language preference bias."


Q8: "Tell me about your Maersk email booking RAG experience. What was the knowledge base? How did you structure retrieval?"

Weak answer: "We used RAG for the booking system. It worked well."

Strong answer: "The email booking system needed to extract structured data from unstructured emails—carrier, port, dates, rates. The knowledge base could include: booking templates, carrier policies, port codes, rate tables. RAG could ground extraction in known formats—when the system sees 'MSC' or 'Maersk,' it retrieves the relevant carrier policy to validate and structure the extraction. Chunking would depend on document type: rate tables might be row-level; policies might be section-level with metadata. We'd use metadata filtering—carrier, port, date range—to narrow retrieval. The vector store choice would depend on existing infra: if we had Postgres for booking data, pgvector could unify; if we needed low latency for high-volume email processing, Qdrant. The key is that RAG wasn't just 'answer questions'—it was grounding extraction in a known schema and policies. That's a different retrieval design than a generic Q&A chatbot."


Quick Fire Round — 12–15 Rapid Q&A

1. What is RAG in one sentence?
Retrieve relevant context at query time, inject into the prompt, let the LLM answer from it. Open-book exam for the model.

2. What are the three RAG pipelines?
Indexing (offline: load, chunk, embed, store), Retrieval (online: embed query, search, rerank), Generation (online: context + query → LLM → answer).

3. Why use recursive over fixed-size chunking?
Recursive preserves semantic boundaries (paragraphs, sentences). Fixed-size cuts mid-sentence and loses context.

4. What is chunk overlap and why use it?
Overlap = shared tokens between adjacent chunks. Prevents losing context at boundaries—a sentence spanning two chunks appears in both.

5. Small vs large chunks—trade-off?
Small: precise retrieval, less context for LLM. Large: more context, noisier (multiple topics). Sweet spot ~512 tokens, depends on use case.

6. What is cosine similarity?
Angle between vectors. Range -1 to 1. Ignores magnitude. Standard for text embeddings.

7. Why ANN instead of exact search?
Exact is O(n), too slow for millions of vectors. ANN (HNSW, IVF) trades ~5% accuracy for 10–100x speed. Recall ~95–99%.

8. Pre-filter vs post-filter?
Pre-filter: apply metadata constraints first, then vector search on subset. Post-filter: search first, filter results. Pre preferred when constraints are strict.

9. When pick pgvector vs Qdrant vs Pinecone?
pgvector: have Postgres, moderate scale. Qdrant: need performance, control. Pinecone: want zero-ops, willing to pay.

10. What is reranking and when use it?
Cross-encoder or Cohere Rerank rescores top-k from initial retrieval. Improves precision. Adds ~100–200ms. Use when retrieval quality matters more than latency.

11. What is HyDE?
Hypothetical Document Embeddings. Generate a hypothetical answer to the query, embed that, use for retrieval. Aligns query space with document space.

12. Does RAG eliminate hallucinations?
No. It reduces them. The model can still ignore context and hallucinate. Need grounding checks and output validation.

13. What determines RAG quality—retrieval or generation?
Retrieval. Wrong chunks → wrong answer, even with GPT-4. Garbage in, garbage out.

14. What is MRL (Matryoshka)?
OpenAI embedding feature: request fewer dimensions (e.g., 256). Truncation preserves most semantic info. Smaller vectors, less storage, faster search.

15. Cohere search_document vs search_query?
Different input types for indexing vs querying. Documents are long/descriptive; queries are short/interrogative. Using the right type improves relevance.


From Your Experience

Prepare your own stories using the STAR format (Situation, Task, Action, Result). Use these prompts to reflect on your work at Maersk building an AI email booking system and AI platform:

How did you implement RAG in the email booking system? What was the knowledge base?
Consider: Did the email booking agent use RAG? If so, what was the knowledge base—booking templates, carrier policies, port information, rate tables? How did you structure the retrieval—was it for answering user questions about booking, or for grounding extraction in known formats? If you did not use RAG, what alternative did you use for knowledge grounding (e.g., few-shot examples, structured schemas)?

What chunking strategy did you use? Why?
Consider: If you had a document corpus (policies, templates, FAQs), how did you chunk it? Fixed-size for speed? Recursive for semantic boundaries? What chunk size and overlap? What drove the choice—document type, retrieval quality, or simplicity?

What vector store did you choose and why?
Consider: Did you use Qdrant, Pinecone, pgvector, Chroma, or something else? What were the decision factors—existing infrastructure, scale, operational preference, cost? Did you evaluate multiple options?


Key Takeaways (Cheat Sheet)

Topic Key Point
RAG Retrieve relevant context at query time, inject into prompt. Lewis et al. 2020. Solves knowledge cutoff and hallucination. Most common enterprise pattern.
Indexing Load → extract → chunk → embed → store. Offline/batch.
Retrieval Query → embed → similarity search → optional rerank → top-k chunks.
Generation Context + query → prompt → LLM → answer. Ground in context.
Fixed chunking Simple, fast. Splits mid-sentence. Use for prototypes or uniform docs.
Recursive chunking Split by paragraph → sentence → word. Preserves boundaries. Default choice.
Semantic chunking Embedding-based break points. Better quality, higher cost. Inconsistent gains.
Chunk size 512 tokens typical. Small = precise retrieval, less context. Large = more context, noisier.
Overlap 10–20% (50–100 tokens). Prevents boundary loss.
OpenAI embeddings text-embedding-3-small (1536d, cheap), text-embedding-3-large (3072d, better). Matryoshka = truncate dims.
Cohere input_type: search_document vs search_query. Multilingual.
Open-source BGE, E5, GTE. Local, no API cost, privacy.
Qdrant Rust, fast, rich filtering. Production default.
Pinecone Managed, zero-ops. Expensive at scale.
pgvector Postgres extension. Use existing infra. Good for moderate scale.
Chroma Lightweight, Python. Prototyping only.
Cosine similarity Angle between vectors. Ignores magnitude. Default for text.
ANN Approximate nearest neighbor. HNSW, IVF. Trade accuracy for speed.
Metadata Add at index. Pre-filter (narrow first) vs post-filter (filter results).
RAG limits Retrieval quality caps answer quality. Query sensitivity. Lost in the middle.

Further Reading (Optional)