Session 13: Vector Databases & Embeddings Deep Dive
Vector databases are the unsung heroes of every RAG system. You probably spent weeks tuning your prompts—but if your retrieval is garbage, your answers will be too. The LLM can only regurgitate what you feed it. Bad embeddings, sloppy indexing, or naive keyword-only search will leave users frustrated no matter how clever your chain-of-thought prompting is. Nobody tweets about HNSW parameters. Nobody brags about their efConstruction tuning at parties. But screw this up and your whole RAG stack collapses. Garbage in, garbage out—and retrieval is the garbage gate.
We've all seen it: a RAG demo that crumbles the moment someone asks about a specific booking ID or a document that uses slightly different wording. The model "hallucinates" because it never found the right chunk. Blame the LLM if you want—but often the real culprit is upstream. Your vector store didn't return it. Your embeddings didn't capture it. Your indexing strategy was wrong for the query distribution. Fix retrieval first. Then worry about prompt engineering.
Here's the thing: at Maersk, when we built the AI email booking automation, retrieval wasn't an afterthought—it was the foundation. You can't prompt-engineer your way out of bad chunks. Customers ask "What's the status of my booking?" or "Show me containers leaving next week." If we can't find the right documents, the agent might as well guess. Same story for the Enterprise AI Agent Platform: every tool-augmented workflow that touches documents depends on fast, accurate vector search. This session is your deep dive into embeddings, indexing, and the databases that make it all work—written for engineers who ship to production, not academics who cite papers.
1. Embedding Models
Interview Insight: Interviewers will probe whether you understand which embedding model fits which use case—cost vs quality, API vs self-hosted, and when to truncate dimensions. Know at least one API option (OpenAI or Cohere) and one open-source option (BGE or E5) cold.
The model landscape is crowded. Here's the cheat sheet. OpenAI text-embedding-3-small (1536 dims, ~$0.02/1M tokens) offers a strong cost/quality balance for most production RAG. We use it everywhere at Maersk—the price is right and the quality holds up. Use it by default—you can always upgrade later. text-embedding-3-large (3072 dims, ~$0.13/1M tokens) is six times more expensive but higher quality when retrieval is paramount. We tried both at Maersk; for most booking and policy docs, small was plenty. Both support Matryoshka Representation Learning (MRL): you can truncate to 256, 512, or 1024 dimensions with graceful quality loss. Earlier dimensions hold more semantic signal; truncating from the end preserves most of it. A 256-dim truncation of text-embedding-3-large can outperform full 1536-dim ada-002 on MTEB. Smaller vectors = less storage, faster indexing, faster search.
Cohere embed-v3 distinguishes itself with input_type: use "search_document" when indexing and "search_query" when querying. That asymmetry matters—the query encodes "what I'm looking for," the document encodes "what this is about." Cohere is multilingual (100+ languages) and supports int8/binary compression. Good for teams that want explicit asymmetry and multilingual corpora.
Open-source models (BGE, E5, GTE, Nomic) give you zero per-token cost and full control. BGE-large-en-v1.5 is strong for English. E5 is instruction-tuned—prepend "query: " or "passage: " for better zero-shot. GTE-Qwen2-7B tops MTEB but needs serious compute. Nomic Embed handles up to 8K tokens for long docs. Run them with sentence-transformers or HuggingFace.
Analogy: Picking an embedding model is like choosing tires for a car. All-season (OpenAI small) works for most roads. Track tires (large or GTE) grip better but cost more and wear faster. Snow tires (Cohere for multilingual, Nomic for long context) fit specific conditions. Don't over-spec—match the terrain.
Why This Matters in Production: At Maersk, we batch-embed thousands of booking emails and policy docs. Wrong model choice means higher API bills or worse recall. We run MTEB retrieval scores and a small domain eval before committing.
Aha Moment: MTEB overall score can mislead. A model great at classification might suck at retrieval. Always check retrieval-specific metrics—that's what RAG cares about.
2. How Embeddings Work
Interview Insight: You don't need to derive contrastive loss, but you should explain that similar texts map to nearby vectors and distance (cosine, L2) reflects semantic similarity. Mention tokenization, mean-pooling, and that different models produce incompatible spaces.
Here's the mental model that will save you in interviews. Embeddings are dense vectors: each text becomes a point in high-dimensional space. "The cat sat on the mat" and "A feline rested on the rug" have high cosine similarity; "Quantum mechanics explains particle behavior" does not. Models learn this via contrastive learning—pulling similar pairs close and pushing dissimilar pairs apart. Triplet loss, InfoNCE, etc. encode meaning into the vector structure.
Pipeline: text → tokenize → model forward pass → vector. Usually mean-pooling over the last hidden state. Output: fixed-dim float32 vector. Fine-tuning on (query, relevant_doc) pairs improves domain-specific retrieval; FlagEmbedding supports BGE/E5/GTE.
Analogy: Embeddings are like GPS coordinates for meaning. "Paris" and "France" are nearby; "Paris" and "sodium chloride" are far. The map (embedding space) is learned—different models use different maps, so you can't mix vectors from different models.
Why This Matters in Production: When we upgraded embeddings at Maersk, we couldn't mix old and new vectors. Full re-embed and index rebuild. Know this before you "just swap the model."
Aha Moment: Vectors from model A and model B are incomparable—their distances are meaningless. Always re-embed everything when changing models.
3. Vector Indexing Algorithms
Interview Insight: HNSW is the one to know inside out. Be ready to sketch the hierarchical graph, explain M/efConstruction/efSearch, and contrast with flat, IVF, and PQ. Interviewers love "why HNSW?" and "when would you use something else?"
Let's start with the boring baseline. Flat / Brute-Force: Compare the query to every vector. O(n), 100% recall. No fancy index—just scan. Use for <100K vectors when latency is acceptable. This is your baseline for evaluating approximate methods; if flat search gives you 95% recall and HNSW gives you 92%, you know what you're sacrificing.
HNSW (Hierarchical Navigable Small World) dominates production. Multi-layer graph: bottom layer has all vectors, each connected to M nearest neighbors (M typically 16–64). Upper layers are "express lanes" with fewer nodes. Search starts at the top, greedily finds nearest neighbor, descends, refines. O(log n) under good conditions. Supports incremental insert, cosine/L2, no full rebuilds. Key params: M (density), efConstruction (index quality, 100–500), efSearch (query accuracy vs latency). Typical: M=16, efConstruction=200, efSearch=100.
IVF (Inverted File Index): k-means clusters; at query time, search only nprobe nearest clusters. Faster build, lower memory, but cluster-boundary issues. PQ (Product Quantization): Compress vectors (e.g., 1536 floats → 48 bytes). Lossy; combine with IVF for billion-scale. ScaNN: Google's anisotropic quantization—better accuracy than PQ when you need it.
Analogy: Flat search is reading every page of a dictionary. HNSW is using the index—jump to the right section, then narrow down. IVF is like chapters; PQ is compressing the font so you fit more on a page.
Why This Matters in Production: We tuned HNSW at Maersk—higher efConstruction for better graphs, but it slowed writes. For read-heavy RAG, we cranked it; for ingest-heavy, we backed off.
Aha Moment: HNSW wins because it balances speed, accuracy, and ops. You can add vectors without rebuilding. IVF and PQ are for when memory or scale forces trade-offs.
4. Vector Database Comparison
Interview Insight: You'll be asked "Qdrant vs Pinecone vs pgvector—when do you use which?" Have a clear decision framework: ops burden, scale, latency, filtering needs, budget.
You'll hear this question in every RAG interview. Don't waffle. Here's the cheat sheet:
| Database | Best for | Trade-offs |
|---|---|---|
| Qdrant | Fast filtering, hybrid search, self-host or cloud | You manage infra (or pay for cloud) |
| Pinecone | Zero-ops, serverless, rapid prototyping | Expensive at 100M+ vectors |
| Weaviate | Multi-modal, GraphQL, built-in vectorization | Memory-heavy, manual scaling |
| Chroma | Dev, demos, <1M vectors | Not for production scale |
| pgvector | Already on Postgres, moderate scale, SQL familiarity | Slower than dedicated DBs |
| Milvus | Billion-scale, distributed | Complex ops (etcd, MinIO, Pulsar) |
Qdrant applies payload filters during vector search (pre-filtering)—critical for accuracy. Supports sparse+dense hybrid natively. p50 ~4ms in benchmarks. pgvector 0.8.0 fixed overfiltering and boosted filtered query perf. Pinecone namespaces give multi-tenancy in one index.
Analogy: Qdrant is a sports car—fast, tunable, you maintain it. Pinecone is a rental—turn-key, pay per mile. pgvector is the car you already own—add a roof rack (extension) and go.
Why This Matters in Production: We evaluated Qdrant and pgvector for the booking RAG. Went with Qdrant for pre-filtering by user/calendar and hybrid search. pgvector would've worked if we'd standardized on Postgres.
Aha Moment: Pre-filtering vs post-filtering matters. Post-filter: vector search first, then filter—selective filters can leave you with too few results. Pre-filter: only matching vectors enter the search. Qdrant does this; some others don't.
5. Hybrid Search
Interview Insight: Know why hybrid beats pure vector or pure keyword, and how RRF works. Interviewers expect you to say "product IDs, error codes, proper nouns" as examples where keyword wins.
This one trips up teams constantly. Vector search misses exact terms—IDs, codes, names. Ask "MAEU12345678 status" and pure semantic search might return generic shipment docs. BM25 guarantees term matches but is blind to synonyms ("cargo" vs "freight"). Hybrid: run both in parallel, merge with fusion. Best of both worlds.
Reciprocal Rank Fusion (RRF): For each doc, score = Σ 1/(k + rank) across retrievers. k=60 typical. No tuning, robust to different score scales. Supported in Qdrant, Elasticsearch, Milvus, Azure AI Search. Linear combination: α × vector_score + (1−α) × keyword_score. Tune α per corpus—more keyword weight when IDs matter.
Analogy: Vector search understands "shipment delay" and "late delivery." Keyword search finds "MAEU12345678." Hybrid is using both the thesaurus and the index.
Why This Matters in Production: Booking references, container IDs, vessel names—pure semantic search would miss them. We use Qdrant's sparse+dense with RRF so we catch both "when does my cargo leave?" and "MAEU12345678 status."
Aha Moment: RRF doesn't need normalized scores. It works on rank alone. Add a third retriever? Just include it in the sum. That's why it's the default fusion method.
6. Metadata Filtering
Interview Insight: Multi-tenant isolation, time ranges, access control—all depend on metadata. Know pre- vs post-filtering and how to structure filters (must/should/must_not).
Attach key-value pairs at index time: source, date, department, tenant_id, access_level. Every vector should carry enough context to filter it. Pre-filtering applies the filter before/during vector search—faster, preserves accuracy, and crucially: you never search vectors that don't match. Post-filtering runs vector search first, then filters—risky when filters are selective; you may need to over-retrieve (fetch 1000, filter down to 5).
Qdrant: must, should, must_not with nested conditions. Example: must: [{key: "department", match: {value: "engineering"}}, {key: "date", range: {gte: "2024-01-01"}}].
Analogy: Metadata is like labels on file folders. "Finance, 2024, confidential." Without them, you're searching a pile of unlabeled papers.
Why This Matters in Production: Every Maersk query filters by tenant. Omitting tenant_id = data leak. Pre-filtering ensures we never even look at another customer's vectors.
Aha Moment: If your DB only does post-filtering and your filter is strict, top-k vector results might have zero matches. You'd need to fetch 10k, filter, return 5. Pre-filtering avoids that waste.
7. Multi-Tenancy
Interview Insight: Three patterns: collection-per-tenant, namespace-per-tenant, metadata tenant_id. Know when each fits and the security trade-off of metadata-only.
Three ways to slice it. Collection-per-tenant: One collection per customer. Complete isolation, easy delete, backup per tenant. Doesn't scale to thousands of tenants—management overhead kills you. Namespace-per-tenant: Single collection, namespaces (Pinecone style). Each namespace = one tenant. Good balance for SaaS. Metadata-based: tenant_id on every vector, filter on every query. Simplest to implement; one bug = cross-tenant leak. Use only with middleware that always injects the filter, or for internal tools where the blast radius is small.
Analogy: Collection-per-tenant is separate apartments. Namespace is floors in a building. Metadata is name tags—fine until someone forgets to check the tag.
Why This Matters in Production: For customer-facing SaaS at Maersk, we use namespace or collection isolation. Metadata-only would be a compliance risk.
Aha Moment: Never trust "we always add the filter in the app." One forgotten filter, one misconfigured client, and you've leaked data. Enforce isolation at the DB layer when possible.
8. Scaling Vector Stores
Interview Insight: Sharding, replication, quantization, dimension truncation. Be ready to say "measure first, then tune efSearch, quantize, shard."
Retrieval slow? Don't throw hardware at it. Horizontal scaling: Shard across nodes; broadcast queries, merge results. Qdrant and Milvus support this. Replication: Copies for redundancy and read scaling—more replicas, more concurrent queries. Index vs query trade-offs: Higher efConstruction = better graph, slower writes. Quantization (PQ, scalar) cuts memory 4×+ with some recall loss—we saw ~2% recall drop with scalar quant at Maersk, worth it for the speed. Smaller dims (Matryoshka) reduce storage and latency.
Cost optimization: Truncate to 512 dims, use quantization, tier hot vs cold data. For 100M+ vectors, consider Milvus or Zilliz Cloud.
Analogy: Scaling is like a library. One building = single node. Branch libraries = shards. Copies of popular books = replication. Microfilm = quantization—smaller but lower fidelity.
Why This Matters in Production: At 2M vectors we hit latency spikes. Scalar quantization in Qdrant cut memory 4× and sped queries. Sharding by doc_id distributed load.
Aha Moment: Don't guess. Profile: is it embedding gen, ANN search, or post-processing? Fix the bottleneck. Often it's efSearch too low (bad recall) or too high (slow).
Code Examples
Embedding Generation (OpenAI + Sentence-Transformers)
from openai import OpenAI
from sentence_transformers import SentenceTransformer
def embed_openai(texts: list[str], model: str = "text-embedding-3-small", dims: int | None = None):
"""OpenAI with optional Matryoshka truncation."""
client = OpenAI()
kwargs = {"model": model, "input": texts}
if dims:
kwargs["dimensions"] = dims # 256, 512, 1024
response = client.embeddings.create(**kwargs)
return [d.embedding for d in response.data]
def embed_sentence_transformers(texts: list[str], model: str = "BAAI/bge-large-en-v1.5"):
"""Local open-source embeddings."""
model = SentenceTransformer(model)
return model.encode(texts, convert_to_numpy=True).tolist()
# Usage
docs = ["Vector databases store embeddings.", "Embeddings are dense representations."]
vecs = embed_openai(docs, dims=512) # Truncate for cost savingsQdrant: Create, Insert, Search with Filters
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct, Filter, FieldCondition, MatchValue
client = QdrantClient(url="http://localhost:6333")
client.create_collection(
collection_name="docs",
vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
hnsw_config={"m": 16, "ef_construction": 200},
)
points = [
PointStruct(
id=1,
vector=[0.1] * 1536, # Replace with real embedding
payload={"text": "Document about RAG", "source": "blog", "department": "engineering"},
),
]
client.upsert(collection_name="docs", points=points)
results = client.search(
collection_name="docs",
query_vector=[0.15] * 1536,
query_filter=Filter(must=[FieldCondition(key="department", match=MatchValue(value="engineering"))]),
limit=5,
)pgvector: Create Table, Search with Cosine
import psycopg2
from pgvector.psycopg2 import register_vector
conn = psycopg2.connect("dbname=vectordb user=postgres")
register_vector(conn)
cur = conn.cursor()
cur.execute("""
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT,
embedding vector(1536)
);
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
""")
conn.commit()
# Search: <=> is L2; use 1 - (embedding <=> q::vector) for cosine approx
cur.execute(
"SELECT id, content, 1 - (embedding <=> %s::vector) AS sim FROM documents ORDER BY embedding <=> %s::vector LIMIT 5",
(query_vec, query_vec),
)Hybrid Search with Qdrant (Sparse + Dense, RRF)
from qdrant_client.http import models
# Query: prefetch from both dense and sparse, fuse with RRF
results = client.query_points(
collection_name="hybrid_docs",
query=models.QueryRequest(
prefetch=[
models.Prefetch(query=models.QueryVector(vector=models.NamedVector(name="dense", vector=dense_vec)), limit=50),
models.Prefetch(query=models.QueryVector(vector=models.NamedSparseVector(name="sparse", vector=sparse_vec)), limit=50),
],
fusion=models.FusionQuery(fusion=models.Fusion.RRF),
limit=10,
),
)Mermaid Diagrams
HNSW Hierarchical Graph
graph TB
subgraph layer2["Layer 2 - Express Lane"]
nodeA2[nodeA] --- nodeB2[nodeB]
nodeA2 --- nodeC2[nodeC]
end
subgraph layer1["Layer 1"]
nodeA1[nodeA] --- nodeB1[nodeB]
nodeA1 --- nodeD1[nodeD]
nodeB1 --- nodeD1
end
subgraph layer0["Layer 0 - All Vectors"]
nodeA0[nodeA] --- nodeB0[nodeB]
nodeA0 --- nodeD0[nodeD]
nodeA0 --- nodeE0[nodeE]
nodeB0 --- nodeD0
end
nodeA2 --> nodeA1
nodeA1 --> nodeA0
nodeB2 --> nodeB1
nodeB1 --> nodeB0Search: start at top, greedily find nearest, descend, refine.
Embedding Pipeline
flowchart LR
txt["Raw Text"] --> tok["Tokenize"]
tok --> fwd["Model Forward"]
fwd --> pool["Mean Pool"]
pool --> vec["Vector"]Vector DB Query Flow
flowchart LR
q["User Query"] --> emb["Embedding Model"]
emb --> ann["ANN Search"]
ann --> flt["Metadata Filter"]
flt --> out["Top-K Results"]Hybrid Search Architecture
flowchart TB
q["Query"] --> bm25["BM25"]
q --> vec["Vector Search"]
bm25 --> rrf["RRF Fusion"]
vec --> rrf
rrf --> top["Top-K"]Pre-filter vs Post-filter
flowchart LR
subgraph pre["Pre-filter"]
f1["Filter First"] --> v1["Vector Search on Subset"]
end
subgraph post["Post-filter"]
v2["Vector Search All"] --> f2["Filter After"]
endConversational Interview Q&A
Q1: How would you choose between Qdrant and pgvector for a RAG system?
Weak answer: "Qdrant is faster. pgvector is easier if you use Postgres." (Too vague; no decision criteria.)
Strong answer: "If we already run Postgres and have moderate scale—under 1M vectors, maybe up to 50M with tuning—pgvector keeps the stack simple. One DB, SQL, transactions. At Maersk we considered pgvector for the booking RAG but went with Qdrant because we needed pre-filtering during vector search—filter by user and calendar before the ANN, not after. pgvector's filtering used to overfilter; 0.8.0 helped. For latency-critical or hybrid sparse-dense, Qdrant wins. For 'good enough' with minimal ops, pgvector."
Q2: Explain HNSW to a non-expert. Why is it default in most vector DBs?
Weak answer: "It's a graph algorithm. It's fast." (No structure, no trade-offs.)
Strong answer: "Think of a multi-story building. Ground floor has every vector, each connected to its nearest neighbors. Upper floors have fewer 'hub' vectors with long-range links. To find similar vectors, you start at the top, hop to the nearest hub, go down a floor, refine, repeat. You skip most of the building. That's O(log n) instead of checking everyone. It's default because it nails the trade-off: high recall, sub-ms latency, incremental updates. IVF is faster to build but worse at boundaries. PQ saves memory but loses accuracy. HNSW is the sweet spot for most workloads—including ours at Maersk for the knowledge base."
Q3: How do you handle multi-tenancy in a vector database?
Weak answer: "We add tenant_id to metadata and filter." (Security risk if filter is ever omitted.)
Strong answer: "Three options. Collection-per-tenant: complete isolation, but doesn't scale to thousands. Namespace-per-tenant: Pinecone-style, one namespace per customer, good balance. Metadata-based: tenant_id on every vector, filter every query—simplest but one bug and you leak data. At Maersk for customer-facing apps we use namespace or collection isolation. Metadata-only is for internal tools where we have middleware that always injects the filter. Never rely on 'we always add it in the app' for sensitive data."
Q4: Your retrieval is slow. How do you debug?
Weak answer: "Add more servers. Use a faster DB." (No diagnosis.)
Strong answer: "Measure first. Is it embedding generation, ANN search, or post-processing? Profile the request. If it's ANN: tune efSearch—lower speeds up but hurts recall, higher helps recall but costs latency. Check quantization: scalar or PQ can cut memory 4× and speed search. If we're at 100M+ vectors, sharding and replication. At Maersk we hit latency spikes at 2M vectors—enabled scalar quantization in Qdrant, 4× memory drop, faster queries. Also: are we over-retrieving? Top-100 when we only need top-10? Smaller embedding dims via Matryoshka can help. Fix the bottleneck, not the whole stack."
Q5: How do you evaluate embedding quality?
Weak answer: "We use MTEB." (Which scores? For what task?)
Strong answer: "MTEB is the standard—56+ tasks, retrieval, classification, reranking, etc. For RAG I care about retrieval-specific scores, not overall average. A model great at classification can underperform on retrieval. We run MTEB retrieval tasks and a small domain eval: 50–200 (query, relevant_doc) pairs from our corpus, measure MRR or NDCG@10. At Maersk we compared OpenAI small, Cohere, and BGE on our booking docs—OpenAI won on our eval, so we went with that. Cost matters: 2% better retrieval at 5× cost might not be worth it."
Q6: What's Matryoshka truncation and when would you use it?
Weak answer: "You can make embeddings shorter." (No mechanism, no use case.)
Strong answer: "Matryoshka Representation Learning trains embeddings so the first k dimensions hold most of the semantic signal. You truncate—256, 512, 1024—without retraining. OpenAI's text-embedding-3 models support it via the dimensions parameter. A 256-dim truncation of 3-large can beat full ada-002. Why use it? Cost: smaller vectors = less storage, faster indexing, faster search. For large corpora that adds up. We could index with 512 dims for fast retrieval, then re-embed top candidates at full dims for reranking if needed."
Q7: How do you upgrade embedding models when all existing vectors are from the old model?
Weak answer: "Re-embed everything." (Correct but incomplete.)
Strong answer: "Vectors from different models are incomparable—their spaces don't align. So we have to re-embed. Full re-embed and index rebuild. For millions of docs that's hours or days. At Maersk we'd run a parallel index with the new model, route a slice of traffic (10%, 50%), compare quality and latency, then cut over. Treat it as a migration: batch re-embed, validate on holdout set, plan cutover. Don't mix old and new in the same index—results would be wrong."
From Your Experience (Maersk)
1. What vector store did you use for the email booking RAG, and why?
We used Qdrant. Needed fast metadata filtering by user_id and calendar_id—pre-filtering during search was non-negotiable so we never risk returning another user's data. pgvector was an option since we already had Postgres elsewhere, but we didn't want to add vector workload to that cluster. Qdrant's hybrid sparse+dense and RRF fusion helped us catch both semantic queries ("when does my shipment leave?") and exact references (container IDs, booking numbers). The Rust-based performance and Python client made integration straightforward.
2. How did you handle embedding generation for the knowledge base?
OpenAI text-embedding-3-small with batches of 100 docs to stay within limits. We used the dimensions parameter set to 512 for cost savings—our eval showed minimal recall loss. Query-side embeddings at request time, with caching for repeated queries. We evaluated BGE-large locally; OpenAI won on our domain test set and we preferred the managed API for ops simplicity.
3. Did you face scaling issues with vector search?
Yes. At ~2M vectors we saw latency spikes during peak. We raised efSearch from 64 to 128 (helped recall, added latency), then enabled scalar quantization in Qdrant. Memory dropped 4×, query speed improved. We also sharded the collection by hash of document_id to spread load. That got p99 under 50ms.
Quick Fire Round
- Cosine vs L2 for embeddings? Cosine is standard for normalized vectors; L2 equivalent for normalized. Use what your model was trained with.
- What is efSearch? HNSW param: neighbors evaluated at query time. Higher = better recall, slower.
- Pre-filter vs post-filter? Pre: filter before/during search. Post: search first, filter after. Pre is better when filters are selective.
- RRF formula? Score = Σ 1/(k + rank). k≈60.
- When is flat search OK? <100K vectors, latency acceptable.
- What does M mean in HNSW? Max connections per node. Higher = denser graph, better recall, more memory.
- Cohere input_type? "search_document" for indexing, "search_query" for querying.
- Matryoshka truncation? Use first k dimensions of embedding; earlier dims have more signal.
- pgvector 0.8.0 improvement? Iterative scans fix overfiltering in hybrid/filtered queries.
- IVF nprobe? Number of clusters to search at query time. Higher = better recall, slower.
- Why hybrid over pure vector? Exact terms—IDs, codes, names—that vector search misses.
- MTEB retrieval vs overall? Retrieval-specific scores matter for RAG; overall can mislead.
- Namespace vs metadata tenant? Namespace = DB-enforced scope. Metadata = filter in query; one bug = leak.
- When Milvus over Qdrant? Billion-scale, distributed, dedicated infra team.
- Product Quantization trade-off? 10–20% recall loss for 10–100× memory reduction.
Key Takeaways
| Topic | Key Point |
|---|---|
| Embedding models | Match to use case; MTEB retrieval scores; Matryoshka for cost |
| How embeddings work | Contrastive learning; incompatible across models; re-embed on upgrade |
| Indexing | HNSW default; flat for small; IVF/PQ for scale/memory |
| DB choice | Qdrant: performance, filtering; Pinecone: zero-ops; pgvector: already Postgres; Milvus: billion-scale |
| Hybrid search | BM25 + vector, RRF fusion; catches exact terms + semantic |
| Metadata | Pre-filter when possible; must/should/must_not |
| Multi-tenancy | Namespace or collection for sensitive; metadata only with safeguards |
| Scaling | Measure; quantize; shard; tune efSearch; smaller dims |
Bottom Line
If you take one thing from this session: retrieval is your RAG bottleneck more often than the LLM. Get your embedding model right, pick a vector DB that supports pre-filtering and hybrid search if you need it, and tune HNSW for your workload. HNSW is the default index—know why. Pre-filtering beats post-filtering when you have selective metadata (tenant, date, access). Hybrid search catches both semantic queries and exact IDs—booking references, container numbers, error codes. And always, always re-embed when you change embedding models; mixing spaces gives you nonsense.
At Maersk we learned this the hard way: when the booking RAG started missing queries, it wasn't the prompts. It was the vectors. Fix the foundation first. Everything else is polish.
Further Reading
- MTEB Leaderboard (mteb.info) — Benchmark embedding models; check retrieval-specific scores.
- Qdrant Docs (qdrant.tech/documentation) — Hybrid search, filtering, quantization.
- HNSW Paper (Malkov & Yashunin, 2018) — "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs."
- Matryoshka Representation Learning (Kusupati et al., 2022) — Dimension truncation for embeddings.
- pgvector (github.com/pgvector/pgvector) — PostgreSQL extension.
- Sentence-Transformers (sentence-transformers.github.io) — Open-source embeddings.
- Cohere Embed v3 (docs.cohere.com) — input_type, multilingual, compression.