Session 10: Evaluations — Online & Offline

You wouldn't ship code without tests. So why are teams shipping AI systems with nothing but vibes? A prompt change that "looks better" in a quick manual check can silently break half your edge cases. A retriever tweak might boost recall in dev while dumping irrelevant noise into production. The only way to know is to measure—and most teams aren't measuring.

Evaluations are to AI systems what unit tests are to traditional software. They define correctness, catch regressions, and let you iterate with confidence instead of crossing your fingers. At Maersk, our Enterprise AI Agent Platform runs evals on every PR; if faithfulness or answer relevancy drops, the merge is blocked. No exceptions. This session is your crash course on offline evals, online evals, RAG metrics, evaluation frameworks, LLM-as-judge, and the eval-driven development loop. By the end, you'll know how to measure AI quality, build golden datasets, and ship with confidence.

1. Why Evaluations Matter

LLMs are fundamentally non-deterministic. Same input, different output—thanks to temperature, sampling, and inherent randomness. You can't treat "it worked when I tried it" as a quality gate. Evaluation-driven development (EDD) flips the workflow: you define what "good" looks like, build evals first (or very early), then iterate until they pass. EDDOps (Evaluation-Driven Development and Operations), formalized in 2025 research, embeds evaluation as a continuous function throughout the LLM lifecycle—not a one-time checkpoint before deploy.

Interview Insight: "How do you ensure quality in LLM systems?"—They want to hear that you treat evals as first-class tests, not an afterthought. Mention golden datasets, CI gates, and production monitoring.

Analogy: Evals are like a flight checklist. Pilots don't skip it because "the plane felt fine last time." You don't skip evals because "the demo looked good." Both are safety-critical.

Why This Matters in Production: Without evals, you're flying blind. A 15% quality drop can go unnoticed until support gets flooded. With evals, you catch regressions before they reach users.

Aha Moment: Evals aren't optional. They're the specification. What you measure is what you optimize for—so choose wisely.

2. Offline Evaluation (Pre-Deployment)

Offline evaluation runs before deployment on curated datasets in a controlled environment. It answers: "Does this system meet our quality bar before we expose it to users?"

Golden datasets are the foundation. Curated input-output pairs with expected answers. Build them from: (1) manual annotation (experts label), (2) production examples (curate good/bad from real traffic), (3) synthetic generation (LLM creates test cases from seeds). Size: 20–50 for initial dev, 100–200 for production confidence, 500+ for fine-tuning. Version them with code—they're part of your spec.

Benchmark suites give standardized tests for extraction, reasoning, formatting. DeepEval supports MMLU, HumanEval, GSM8K, TruthfulQA, DROP, HellaSwag, BIG-Bench Hard.

Unit tests for LLMs assert on structure (valid JSON?), content (required fields?), factual accuracy (correct given context?), safety (no PII, no harm). Often probabilistic—e.g., faithfulness ≥ 0.9 instead of exact string match.

Regression tests ensure changes don't break behavior. Run full eval suite on every PR. Block merge if scores drop.

Evaluation in CI/CD automates this. DeepEval + pytest + GitHub Actions. Set thresholds (faithfulness ≥ 0.85, answer relevancy ≥ 0.8), block merges that fail.

Interview Insight: "How do you run evals in CI?"—Describe a parametrized pytest over a golden dataset, thresholds, and merge blocks. Show you've done it.

Analogy: Offline evals are like a driving test. You prove competence in a controlled environment before you hit the highway.

Why This Matters in Production: CI evals catch regressions before they ship. At Maersk, our email booking agent has ~150 golden examples; every PR runs through them. We've blocked several "improvements" that would have broken edge cases.

Aha Moment: Golden datasets are living documents. Every production failure that matters becomes a test case. Over time, your eval suite encodes your entire failure history.

3. Online Evaluation (Post-Deployment)

Online evaluation runs in production on real user interactions. It answers: "Is the system performing well for actual users?"

A/B testing splits traffic between two prompt or model versions. Measure thumbs-up rate, task completion, or custom quality scores. Langfuse, Braintrust support prompt A/B with built-in metrics.

Human feedback—thumbs up/down, corrections, escalations. Aggregate to find patterns ("users downvote refund policy answers") and expand golden datasets.

Production monitoring tracks quality over time. Alert when hallucination rate, PII leak rate, or LLM-as-judge scores drop below threshold.

Shadow testing runs the new model alongside the old on the same requests—compare outputs without serving to users.

Canary deployments serve new version to 5% of traffic, monitor, ramp up if metrics hold. Limits blast radius.

Interview Insight: "What's the difference between online and offline evals?"—Offline = before deploy, curated data, deterministic. Online = after deploy, real traffic, feedback loops. You need both.

Analogy: Offline evals are the driving test; online evals are the GPS tracking your real-world driving. One proves you can drive; the other proves you're not crashing on the road.

Why This Matters in Production: Distribution shift happens. Edge cases you never seeded in golden data show up. Online evals catch them; production failures feed back into the dataset.

Aha Moment: The eval flywheel: production failures → add to golden dataset → improve system → verify with evals. Failures become tests. The system gets smarter.

4. RAG-Specific Metrics

RAG pipelines have two components—retriever and generator. Each has distinct failure modes. Evaluate them separately.

Faithfulness: Is the answer grounded in the retrieved context? Detects hallucinations. Formula: (claims supported by context) / (total claims in answer). Extract claims via LLM, check each against context. Score 0–1. DeepEval and RAGAS implement this; RAGAS can use Vectara's HHEM-2.1-Open (small T5 classifier) for efficient production use without an LLM judge.

Answer Relevancy: Does the answer address the question? RAGAS: generate "reverse questions" from the answer, compute cosine similarity to original question. High similarity = relevant.

Context Precision: Are retrieved chunks relevant and correctly ranked? RAGAS uses Average Precision over binary "useful" verdicts per chunk. For rank i: P@i = (sum of useful verdicts up to i) / i. AP = sum of (P@i × v_i) / (total useful + ε). Ordering matters—an irrelevant chunk at rank 1 tanks the score; useful chunks should be ranked first.

Context Recall: Did retrieval find all relevant information? Proportion of ground-truth statements attributable to the retrieved context. RAGAS: (1/m) × Σ a_k where m = statements in reference, a_k = 1 if statement k is in context. Requires ground truth; RAGAS has LLM Context Recall for reference-free eval.

Groundedness (DeepEval): Similar to faithfulness but uses entailment scoring per claim.

Interview Insight: "How do you evaluate a RAG system?"—Retriever: context precision, recall, relevancy. Generator: faithfulness, answer relevancy. Run both on golden dataset, set thresholds, block deploys that fail.

Analogy: Faithfulness is like a fact-checker: "Can you trace every claim back to a source?" Answer relevancy is like an editor: "Did you actually answer the question?"

Why This Matters in Production: At Maersk, our RAG pipelines power support and internal knowledge search. A single hallucination about shipping rates or refunds is costly. Faithfulness ≥ 0.85 is non-negotiable.

Aha Moment: One unsupported claim tanks faithfulness. "You can get a refund within 60 days" when context says 30—that's a 0.0 on that claim. Precision matters.

5. General LLM Metrics & Evaluation Frameworks

Beyond RAG: correctness (factually right?), completeness (covers all aspects?), conciseness (not verbose?), toxicity (no harm?), bias (fair?), latency (P50, P95, P99), cost (tokens, dollars). Combine into composite gates: "Ship only if faithfulness ≥ 0.85 AND answer relevancy ≥ 0.8 AND P95 < 3s."

DeepEval: 50+ metrics, pytest-equivalent for LLMs, runs locally, CI/CD. RAG (faithfulness, relevancy, context precision/recall), agentic (task completion, tool correctness), benchmarks (MMLU, GSM8K). Best for comprehensive assertion-style evals and CI.

RAGAS: Reference-free RAG evaluation. Faithfulness, answer relevancy, context precision, context recall. Collections-based API (2025): Faithfulness, ContextPrecision, ContextRecall with ascore() / score(). Best for RAG-specific metrics.

LangSmith: LangChain's observability + evaluation. Datasets, annotation queues, eval runs, tracing. Best when your stack is LangChain/LangGraph.

Phoenix (Arize): Open-source observability, built-in RAG evaluators, trace visualization, retrieval quality monitoring. Good for production debugging.

Choosing: DeepEval for CI/CD + comprehensive evals; RAGAS for RAG metrics; LangSmith for LangChain tracing + datasets. Many teams combine: LangSmith for tracing, RAGAS for RAG metrics, DeepEval for assertions.

Interview Insight: "Compare DeepEval, RAGAS, and LangSmith"—DeepEval = full eval framework, CI, assertions. RAGAS = RAG-specific, reference-free. LangSmith = tracing + datasets, LangChain-native.

Analogy: DeepEval is a full gym (weights, cardio, classes). RAGAS is the running track (RAG-specific). LangSmith is the member app that logs everything (tracing + evals).

Why This Matters in Production: You don't pick one. Production setups often use LangSmith for trace + dataset management, RAGAS for retrieval metrics, DeepEval for assertion tests in CI.

Aha Moment: Frameworks are complementary. The principles—golden datasets, component metrics, CI gates, production monitoring—transcend tool choice.

6. LLM-as-Judge

What it is: Use a strong LLM (GPT-4, Claude) to evaluate another LLM's output. Judge sees question, answer, criteria; returns score and optionally reasoning.

How: Prompt with "Rate correctness 1–5, explain." Use structured JSON output. RAGAS/Deepeval use schema-constrained prompts + Pydantic + retry for malformed outputs.

Benefits: Scalable (no humans per example), handles subjective quality (helpfulness, tone), no ground truth needed.

Limitations: Judge has biases, is expensive (every eval = LLM call), can be gamed. Position bias: judges often prefer first option in pairwise. Rating indeterminacy (2025): forced-choice can bias when multiple valid answers exist—up to 31% worse. JudgeBench: GPT-4o barely beats random on hard tasks.

Mitigations: Detailed rubrics. Multiple judges, aggregate. Swap positions in pairwise. Calibrate against humans. Use multi-label "response set" instead of forced-choice. Use mean of judgment distributions over greedy (mode)—outperforms in 42 of 48 cases. Avoid chain-of-thought for judges—hurts performance up to 6.5%.

Interview Insight: "Explain LLM-as-judge limitations"—Bias, cost, gameability, position bias, rating indeterminacy. Mitigate with rubrics, calibration, multi-label ratings, mean over mode.

Analogy: LLM-as-judge is like a substitute teacher grading essays. Helpful at scale, but has blind spots. You still need the real teacher (human eval) for hard cases.

Why This Matters in Production: Use for subjective quality (tone, helpfulness). Don't use when ground truth exists—prefer rule-based or exact match. At Maersk we use LLM-as-judge for a sample of traffic; we calibrate quarterly against human ratings.

Aha Moment: Judge competence correlates with grading accuracy. If the judge can't answer the question, it can't reliably grade the answer. Match judge strength to task difficulty.

7. Evaluation-Driven Development & Golden Datasets

Workflow: Define success → build golden dataset → implement → run evals → iterate (prompt, retrieval, model) → ship when passing → monitor. Evals are the specification.

Version control: Track dataset, metrics, results with code. Each run tied to dataset, prompt, model versions. EDDOps unifies offline + online in a closed feedback loop.

Regression gates: Evals = promotion criteria. Don't promote if they fail.

Golden dataset sources: Production logs (curate good/bad), expert annotation, synthetic generation.

Diversity: Happy path, edge cases, adversarial, different languages. Include past failure modes.

Size: 20–50 dev, 100–200 production, 500+ fine-tuning.

Quality over quantity: 50–100 well-curated examples covering critical paths > 500 noisy ones. "Must-pass" examples—embarrassing to fail—always in suite.

Active learning: Prioritize annotating where the system is uncertain.

Interview Insight: "How do you build golden datasets?"—Production logs + expert annotation + synthetic. Diversity, version with code, active learning for hard cases.

Analogy: Golden datasets are like a driving instructor's checklist. It grows every time you hit a new scenario. Eventually it covers the full space of "things that could go wrong."

Why This Matters in Production: At Maersk we started with 20 manual examples, then pulled 100 from production where users gave thumbs up. Added 30 adversarial from support tickets. Now ~150—versioned, with lineage for every run.

Aha Moment: Failures become tests. Every time you fix a bug, add it to the golden set. That bug can never slip through again.

Architecture Diagrams

Offline + Online Evaluation Flow

flowchart TB
    subgraph Offline["Offline (Pre-Deploy)"]
        goldenData[Golden Dataset]
        evalSuite[Eval Suite]
        ciPipeline[CI Pipeline]
        goldenData --> evalSuite
        evalSuite -->|Pass| ciPipeline
        ciPipeline -->|Merge| deploy
    end
 
    subgraph Online["Online (Post-Deploy)"]
        traffic[Production Traffic]
        abTest[A/B Test]
        humanFeedback[Human Feedback]
        monitoring[Monitoring]
        traffic --> abTest
        traffic --> humanFeedback
        abTest --> monitoring
        humanFeedback --> monitoring
        monitoring -.->|Failures| goldenData
    end
 
    deploy[Deploy] --> traffic

Evaluation-Driven Development Cycle

flowchart LR
    defineSuccess[Define Success] --> buildGolden[Build Golden]
    buildGolden --> implement[Implement]
    implement --> runEvals[Run Evals]
    runEvals --> passGate{Eval Pass?}
    passGate -->|No| iterate[Iterate]
    iterate --> implement
    passGate -->|Yes| ship[Ship]
    ship --> monitor[Monitor]
    monitor --> failures[Failures]
    failures --> buildGolden

CI/CD with Eval Gates

flowchart TB
    pr[Pull Request] --> build[Build]
    build --> unit[Unit Tests]
    unit --> llmEvals[LLM Evals]
    llmEvals --> threshold{Score >= Threshold?}
    threshold -->|No| block[Block Merge]
    threshold -->|Yes| merge[Merge]
    merge --> staging[Deploy Staging]
    staging --> canary[Canary Prod]

RAG Evaluation Components

flowchart LR
    query[User Query] --> retriever[Retriever]
    retriever --> chunks[Retrieved Chunks]
    chunks --> generator[Generator]
    generator --> answer[Answer]
    chunks --> contextPrecision[Context Precision]
    chunks --> contextRecall[Context Recall]
    answer --> faithfulness[Faithfulness]
    answer --> relevancy[Answer Relevancy]

Code Examples

DeepEval: Faithfulness and Answer Relevancy

from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric
 
actual_output = "We offer a 30-day full refund at no extra cost."
retrieval_context = ["All customers are eligible for a 30 day full refund at no extra cost."]
 
metric_faith = FaithfulnessMetric(threshold=0.7, model="gpt-4.1", include_reason=True)
metric_relev = AnswerRelevancyMetric(threshold=0.7)
 
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output=actual_output,
    retrieval_context=retrieval_context,
)
 
assert_test(test_case, [metric_faith, metric_relev])

RAGAS: Collections-Based Evaluation

from openai import AsyncOpenAI
from ragas.llms import llm_factory
from ragas.metrics.collections import Faithfulness, ContextPrecision, ContextRecall
from ragas import evaluate
from ragas.dataset_schema import SingleTurnSample
 
client = AsyncOpenAI()
llm = llm_factory("gpt-4o-mini", client=client)
 
samples = []
for query, reference in zip(sample_queries, expected_responses):
    relevant_docs = rag.get_most_relevant_docs(query)
    response = rag.generate_answer(query, relevant_docs)
    samples.append(SingleTurnSample(
        user_input=query,
        retrieved_contexts=relevant_docs,
        response=response,
        reference=reference,
    ))
 
result = evaluate(
    dataset=samples,
    metrics=[Faithfulness(llm=llm), ContextPrecision(llm=llm), ContextRecall(llm=llm)],
)
print(result)  # faithfulness: 0.86, context_precision: 0.92, context_recall: 0.88

RAGAS: Single Sample Faithfulness

from openai import AsyncOpenAI
from ragas.llms import llm_factory
from ragas.metrics.collections import Faithfulness
 
client = AsyncOpenAI()
llm = llm_factory("gpt-4o-mini", client=client)
scorer = Faithfulness(llm=llm)
 
result = await scorer.ascore(
    user_input="When was the first super bowl?",
    response="The first superbowl was held on Jan 15, 1967",
    retrieved_contexts=[
        "The First AFL–NFL World Championship Game was played on January 15, 1967 at the Los Angeles Memorial Coliseum."
    ]
)
print(f"Faithfulness: {result.value}")  # 1.0

LLM-as-Judge with Structured Output

import json
from openai import OpenAI
 
def llm_as_judge(question: str, answer: str, criteria: str) -> tuple[float, str]:
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are an expert evaluator. Rate 1-5 and explain."},
            {"role": "user", "content": f"Question: {question}\n\nAnswer: {answer}\n\nCriteria: {criteria}\n\nJSON: {{\"score\": <1-5>, \"reasoning\": \"...\"}}"}
        ],
        response_format={"type": "json_object"},
    )
    data = json.loads(response.choices[0].message.content)
    return data["score"], data["reasoning"]

Golden Dataset from Production Logs

import pandas as pd
 
def build_golden_from_logs(logs_path: str, output_path: str, min_thumbs_up: int = 10):
    df = pd.read_csv(logs_path)
    good = df[(df["feedback"] == "thumbs_up") & (df["corrected"] == False)]
    aggregated = good.groupby("input").agg({"feedback": "count", "output": "first"}).reset_index()
    aggregated = aggregated[aggregated["feedback"] >= min_thumbs_up]
    golden = aggregated[["input", "output"]].rename(columns={"output": "expected_output"})
    golden.to_csv(output_path, index=False)
    return golden

Pytest + CI/CD

import pytest
from deepeval import assert_test
from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset
 
dataset = EvaluationDataset()
dataset.pull(alias="RAG Evals")
 
@pytest.mark.parametrize("golden", dataset.goldens)
def test_rag_pipeline(golden):
    response, chunks = your_rag_app(golden.input)
    test_case = LLMTestCase(
        input=golden.input,
        actual_output=response,
        retrieval_context=chunks,
    )
    assert_test(
        test_case=test_case,
        metrics=[FaithfulnessMetric(threshold=0.85), AnswerRelevancyMetric(threshold=0.8)],
    )

Run: deepeval test run test_rag_pipeline.py

A/B Test with Stable Assignment

import random
from dataclasses import dataclass
 
@dataclass
class ABTestConfig:
    prompt_a: str
    prompt_b: str
    traffic_split: float = 0.5
 
def get_prompt_for_request(config: ABTestConfig, user_id: str) -> str:
    hash_val = hash(user_id) % 100
    if hash_val < config.traffic_split * 100:
        return config.prompt_a
    return config.prompt_b
 
# In handler:
variant = get_prompt_for_request(ab_config, user_id)
response = llm.invoke(variant.format(query=query))
# Log: user_id, variant, response, latency, feedback

Conversational Interview Q&A

1. "How do you evaluate a RAG system end-to-end?"

Weak answer: "We use RAGAS and check if the answers look good. We have some test cases."

Strong answer: "We evaluate at two levels—retriever and generator. For the retriever we track context precision and context recall; for the generator we track faithfulness and answer relevancy. We run these on a golden dataset of ~150 examples covering happy path, edge cases, and known failure modes. At Maersk we use DeepEval with pytest, set thresholds at 0.85 faithfulness and 0.8 relevancy, and block merges that fail. We also sample 10% of production traffic for LLM-as-judge scores and alert on drift."

2. "Explain LLM-as-judge. What are its limitations?"

Weak answer: "You use another LLM to grade. It's scalable but has some bias."

Strong answer: "LLM-as-judge uses a strong model like GPT-4 to score another model's output. It's scalable and handles subjective quality like tone. Limitations: the judge has biases, it's expensive, and there's position bias in pairwise comparison. 2025 research shows forced-choice ratings can perform 31% worse when multiple valid answers exist—prefer multi-label response sets. We mitigate with detailed rubrics, multiple judges, position swapping, and quarterly calibration against human ratings. At Maersk we use it for subjective quality on a sample of traffic, but for factual correctness we prefer rule-based or exact match when ground truth exists."

3. "How do you set up evals on every PR?"

Weak answer: "We run some tests in CI. I think we use pytest."

Strong answer: "We use DeepEval with pytest. A parametrized test iterates over our golden dataset, runs the full RAG pipeline for each input, and asserts on faithfulness and answer relevancy. If any metric falls below threshold, the test fails and the merge is blocked. The dataset is versioned in Git. We run deepeval test run in GitHub Actions; the whole suite takes about 5 minutes. At Maersk this is part of our standard PR checks—no exceptions."

4. "Online vs offline evaluation—when do you need both?"

Weak answer: "Offline is before deploy, online is after. You need both for coverage."

Strong answer: "Offline evals run before deploy on curated data—golden datasets, benchmarks, regression tests. They're deterministic and catch regressions in CI. Online evals run in production—A/B tests, human feedback, monitoring. They catch distribution shift and edge cases you never seeded. At Maersk we use both: offline gates every PR, online monitors thumbs-up rate and LLM-as-judge scores. Production failures get triaged weekly and become new golden examples. That closed loop is how we avoid regressions and keep improving."

5. "Quality dropped 15% after a prompt change. How do you respond?"

Weak answer: "Roll back and fix the prompt. Maybe add more test cases."

Strong answer: "First, roll back immediately to stop the bleed. Then investigate: run the offline eval suite on both versions and identify which test cases regressed. Add those to the golden dataset. Analyze patterns—did we break a specific query type? Fix the prompt (or revert if the 'improvement' was wrong), re-run evals until they pass, then re-deploy with a smaller canary. At Maersk we use canary deployments (5% traffic) so we'd catch this before full rollout. The key is having monitoring to detect and a playbook to respond—rollback, diagnose, fix, verify."

6. "How do you build golden datasets?"

Weak answer: "We manually label some examples. Maybe 50 or so."

Strong answer: "Three sources: production logs where users gave thumbs up and didn't correct; expert annotation for domain-specific cases; synthetic generation from seed examples. At Maersk we started with 20 manual examples, pulled 100 from production, added 30 adversarial from support tickets. We aim for diversity—different formats, languages, edge cases. Two annotators label new examples; we measure agreement and adjudicate. We version the dataset in Git and track lineage for reproducibility. Active learning: we prioritize examples where the system is uncertain. Size: 20–50 for dev, 100–200 for production."

7. "Compare DeepEval, RAGAS, and LangSmith."

Weak answer: "DeepEval does evals, RAGAS is for RAG, LangSmith is for LangChain. They're all different."

Strong answer: "DeepEval is the comprehensive option—50+ metrics, pytest integration, runs locally. Best for CI/CD and assertion-style evals. RAGAS pioneered reference-free RAG evaluation—faithfulness, answer relevancy, context precision/recall. Best for RAG-specific metrics. LangSmith is LangChain's tracing + evaluation platform—datasets, annotation queues, eval runs. Best when you're on LangChain. In practice we combine them: LangSmith for tracing and datasets, RAGAS for RAG metrics, DeepEval for CI assertions. At Maersk we use DeepEval in CI and LangSmith for observability."

From Your Experience (Maersk Prompts)

1. Architecture of online + offline evals

We run offline evals in CI on every PR. Golden dataset of ~150 examples covering customer support, product questions, and known failure modes. Each example goes through our full RAG pipeline; we compute faithfulness, answer relevancy, and context recall with DeepEval. Thresholds: 0.85 faithfulness, 0.8 relevancy—merge blocked if below. Online: 50/50 A/B tests for prompt changes with stable assignment by user ID. We monitor thumbs-up rate, task completion, and LLM-as-judge on 10% of traffic. Production failures triaged weekly; confirmed bugs become new golden examples. Closed loop: production → dataset → development → evals → deploy.

2. Metrics for the email booking agent

Extraction accuracy (dates, times, attendees, title)—golden set of 50 emails, exact/fuzzy match. Booking success rate (did the calendar API accept?). User confirmation rate (strong quality signal). P95 latency under 5 seconds. Hallucination rate via LLM-as-judge: "Does this response contain information not in the original email?" We also judge tone and clarity. All run on every PR and sampled in production.

3. Building golden datasets

Started with 20 manual examples—common email formats (Google, Outlook, plain text) and intents (new meeting, reschedule, cancel). Pulled 100 from production with positive feedback. Filtered for diversity: senders, languages, date formats, edge cases (recurring, timezone ambiguity). Added 30 adversarial from support tickets—ambiguous dates, malformed emails. Annotation: two annotators, compute agreement, adjudicate. Version in Git with lineage (dataset, prompt, model) for each run.

Quick Fire Round

Faithfulness formula? Supported claims / total claims. 0–1.
Answer relevancy (RAGAS)? Reverse questions, cosine similarity to original. High = relevant.
Context precision vs recall? Precision = are top chunks relevant? Recall = did we get everything?
Golden dataset size for production? 100–200 minimum.
LLM-as-judge main limitation? Bias, cost, position bias, rating indeterminacy. Use mean over mode.
EDDOps? Evaluation-Driven Development and Operations—evals as continuous function in LLM lifecycle.
Shadow testing? Run new model alongside old on same requests; compare without serving new to users.
Canary deployment? New version to small % (e.g. 5%), monitor, ramp if metrics hold.
Best use for DeepEval? CI/CD, assertion-style evals, comprehensive metrics.
Best use for RAGAS? RAG-specific metrics, reference-free evaluation.
Best use for LangSmith? LangChain tracing + datasets + evals.
Position bias mitigation? Swap positions in pairwise comparison.
Eval flywheel? Failures → golden dataset → improve → evals → ship → monitor.
Why avoid CoT for judges? Can harm performance up to 6.5%; collapses judgment distribution.
HHEM-2.1-Open? Vectara's small T5 for hallucination detection; efficient prod use without LLM judge.

Key Takeaways

Topic	Takeaway
Necessity	Evals are non-negotiable. LLMs are non-deterministic; measure or guess.
Offline + Online	Offline gates PRs; online monitors production and feeds the flywheel.
RAG Metrics	Retriever: context precision, recall. Generator: faithfulness, answer relevancy.
LLM-as-Judge	Use for subjective quality. Mitigate bias, calibrate, prefer mean over mode. Avoid CoT.
Eval-Driven Dev	Define success → golden dataset → implement → evals → iterate → ship when passing.
Frameworks	DeepEval (CI), RAGAS (RAG), LangSmith (tracing + datasets). Often combined.
Golden Datasets	Living docs. Start 20–50, grow from failures, version with code.