AI Engineer Prep

Session 12: Prompt Management & Dataset Management

Your best prompt is sitting in someone's Slack message from three months ago. No one can find it. The version in production? Nobody knows who changed it or why. And when something breaks at 2am, you're left staring at a wall of text, wondering if the problem is the model, the context, the temperature, or that one word someone swapped in the system instruction last Tuesday.

That's the reality of ad-hoc prompt management. Teams waste 30–40% of their prompt engineering time recreating work or debugging poor tracking when prompts live in notebooks, Google Docs, or Notion. Broken production systems, lost iterations, compliance headaches. Meanwhile, a single untested prompt change has cost companies millions—we're not talking hypotheticals. The good news? The field has caught up. PromptOps (Prompt Engineering Operations) applies software engineering rigor to prompts. You version them, deploy them in stages, and evaluate before you ship. At Maersk, where we run an Enterprise AI Agent Platform with centralized LLMs, guardrails, evaluations, and prompt/dataset management, this isn't optional. It's how we sleep at night.


1. Why Prompt Management Matters

Interview Insight: Interviewers want to know you treat prompts as production assets, not folklore. Say "version control," "single source of truth," and "rollback without redeploy."

Analogy: Think of prompts like recipes in a restaurant kitchen. If every chef scribbles changes on sticky notes and nobody tracks which recipe is "official," you get inconsistent dishes and no way to roll back when a customer complains. A prompt registry is a proper recipe book: versioned, auditable, and the single source of truth.

Prompts are critical business logic. A small wording change can tank accuracy, spike costs, or break safety guardrails. A production prompt has several components: instructions (the task), context (supporting info), variables (dynamic inputs), and model settings (temperature, max tokens). Manage them as one version-controlled unit. Version chaos happens when the same prompt exists in five places—repos, docs, chat—and no one knows which one is live. Deployment friction happens when prompts are embedded in code: every update means a full deploy. Invisible degradation happens when you "feel" like a change helped—until metrics quietly tank.

Why This Matters in Production: When monitoring surfaces a low-scoring query, you trace it back to the exact prompt version. Rollback = switch the registry label. No code deploy.

Aha Moment: Prompts behave differently from traditional code. Same prompt, different model version? Different behavior. That's why linking versions to outcomes—every trace records which prompt produced which output—is non-negotiable.


2. Prompt Versioning

Interview Insight: "We version prompts like code. Every change creates a new version. We never overwrite. Diffs, rollback, environment separation."

Analogy: Like Git for prompts. You don't edit main in place—you commit. You can see what changed, who changed it, and flip back instantly if something goes wrong.

What to version: template text, variables, model config (temperature, model name), metadata (author, purpose, last tested). Each version gets a unique ID. Loading v2.3.1 always returns the exact same prompt—reproducible, debuggable. Diff visibility is crucial: small wording changes can cause large behavior shifts.

Approaches: Git-based (prompts in repo—simple but couples to code deploys). Database-backed (registry service, API—decouples, runtime retrieval). Dedicated tools: PromptLayer, Braintrust, Langfuse—version control, diffs, rollback, release labels (prod/staging), collaboration. HumanLoop shut down Sept 2025; PromptLayer offers comparable registry + A/B testing.

Environment separation: dev → staging → production. Changes advance only after validation at each stage.

Why This Matters in Production: "Which prompt is live?" should be answerable in one click. Trace → prompt version → rollback in seconds.

Aha Moment: Immutable versions. Never overwrite. That one rule eliminates 80% of "what changed?" debugging.


3. Prompt Registries

Interview Insight: "We use a centralized registry. Agents fetch prompts at runtime. Content teams can deploy updates without touching code."

Analogy: Like a CDN for prompts. You don't bake images into your app—you fetch them. Same idea. Prompts live in a registry; agents pull them by name and environment.

A prompt registry is centralized storage—single source of truth. Schema: name, version, template, variables, model config, description, tags, performance metrics, last updated. Fetch by name, version, or release label: promptlayer_client.templates.get("my_template"). Agents fetch at runtime; no hardcoding.

Benefits: Staged deployment (prompts move through dev→staging→prod with quality gates). Progressive rollout (feature flags, canary, A/B). Content teams and domain experts can update prompts with the right permissions—no engineer required.

Why This Matters in Production: At Maersk, our AI platform serves multiple agents. A shared registry means one place to govern, audit, and roll out changes.

Aha Moment: Decoupling prompts from code means prompt updates don't trigger full deployments. Faster iteration, lower risk.


4. Prompt Templating

Interview Insight: "We use Jinja2 for dynamic prompts. Variables, conditionals, loops. Safe escaping. Composable fragments for shared system instructions."

Analogy: Like HTML templates for web pages. Structure stays; content swaps in. You wouldn't hardcode every possible page—same with prompts.

Jinja2: {{ variable }}, {% if %}, {% for %}. Perfect for dynamic prompts—different instructions by user tier, iterating over retrieved context. Safe escaping prevents injection.

LangChain ChatPromptTemplate: System, human, AI roles. MessagesPlaceholder for chat history. Integrates with chains and agents.

F-strings: Simple but limited—no conditionals/loops, injection risk if careless. Use for trivial cases only.

Composable fragments: Break prompts into reusable parts—system base, task instructions, few-shot examples. Agents share a base (guardrails, tone) while varying task-specific fragments. Supports A/B testing (swap the fragment) and DRY prompts.

Why This Matters in Production: At Maersk, our booking automation prompt varies by document type and customer tier. Templating makes that manageable.

Aha Moment: Composable fragments = shared guardrails across agents + per-agent customization. One safety base, many task variants.


5. A/B Testing Prompts

Interview Insight: "We don't ship on intuition. We A/B test. Deterministic assignment by user_id. Track accuracy, latency, cost per variant. Statistical significance before rollout."

Analogy: Like testing two headlines for an ad. You run both, measure which converts, then ship the winner. Prompts are no different.

Implementation: Assign each request to variant A or B. Deterministic by user_id: hash(user_id) % 2 so the same user always gets the same variant. Track per-variant: task accuracy, quality scores, latency, cost. Tools: Langfuse, Braintrust, PromptCompose—traffic split, metrics dashboards.

Sample size: ~100–200 per variant minimum. For 80% pass rate, 5% margin, 95% confidence: ~246 samples per scenario. Chi-squared, confidence intervals. Multi-armed bandits: adaptive allocation—shift traffic to the winner as data accumulates.

What to test: prompt text, system instructions, few-shot count, temperature, model choice. When: apps with thousands of daily users. When NOT: healthcare, legal, safety-critical (errors unacceptable); low-volume (<100 daily users).

Why This Matters in Production: Our email booking automation runs at scale. Before we change extraction logic, we validate on traffic. Guesswork = risk.

Aha Moment: A/B testing transforms "I think this is better" into "Variant B beats A by 12% on our metric at p<0.05." That's how you ship confidently.


6. Dataset Management for Evaluation

Interview Insight: "Golden datasets are versioned. We link eval results to dataset version. Every production failure gets added to the golden set. Evaluation flywheel."

Analogy: Like a test suite for code. You don't ship without running tests. Evals are your tests—and your test data needs to be versioned, representative, and growing.

Dataset types: Golden—curated, expert-annotated input/output pairs; ground truth. Production—sampled from real traffic; representative but may need curation. Adversarial—edge cases, prompt injection, tricky inputs. Synthetic—LLM-generated; fast but may lack diversity, introduce bias.

Versioning: Track dataset changes. Link eval runs to dataset version—reproducible. Golden datasets should NOT contain pre-computed outputs; generate at eval time so you can compare different model/prompt versions.

Evaluation flywheel: Every production failure → add to golden dataset. Failures become test cases. Over time, the dataset grows and the system improves. Target: 100–500 goldens for production confidence. Start small (20–50), grow continuously. Never overfit—use production-sampled and adversarial data to stress-test. Data contamination (2025): Training and test must not overlap; LLMs may have seen benchmarks.

Why This Matters in Production: Maersk's evaluation pipelines run against versioned golden datasets. CI gates block merges if scores drop. Failures feed the flywheel.

Aha Moment: The golden dataset is a living asset. It evolves with your product, your failures, and your compliance requirements.


7. Synthetic Data Generation

Interview Insight: "We use LLMs to bootstrap eval data. Human review a sample. Second LLM as quality judge. Never rely solely on synthetic when ground truth is critical."

Analogy: Using an LLM to test an LLM sounds circular—and it is, a bit. But like using a calculator to check your math, it works if you add guardrails.

Approaches: Generate QA pairs from source docs (RAG eval), edge cases, adversarial examples. Frameworks: DataGen (ICLR 2025)—attribute-guided generation, code verification, RAG for factual validation.

Model selection (2025 research): GPT-4o excels at generating new problems; Claude-3.5-Sonnet at enhancing existing ones. Data generation ability ≠ problem-solving ability. Pick by data quality metrics, not capability benchmarks.

Quality control: Human review a sample. Heuristics (reject too-short, duplicates). Second LLM as judge. SynQuE (2025): rank synthetic datasets by expected real-world performance with limited unlabeled real data—on text-to-SQL, top-3 selection improved 30.4%→38.4%.

Risks: Factual inaccuracies, lack of diversity, bias amplification. When to use: bootstrap when you have no labels; increase diversity; generate hard examples. When NOT: ground truth critical—use human annotation.

Why This Matters in Production: At Maersk, we use synthetic data to expand our golden sets for booking extraction. We always validate with human-annotated and production-sampled data.

Aha Moment: Synthetic data is a complement, not a replacement. Use it to scale—then validate with real data.


8. Annotation Workflows

Interview Insight: "Clear guidelines, rubrics, examples. We measure inter-annotator agreement. Active learning to prioritize uncertain examples."

Analogy: Like training new hires. Give them a playbook, examples of good/bad work, and a way to check if they're consistent. Annotation is the same.

Who annotates: Domain experts (highest quality, expensive). Crowd workers (scale, variable quality—needs guidelines + checks). Internal team (good compromise).

Guidelines: Clear instructions, good/bad examples, rubrics per label/score. Ambiguous guidelines = inconsistent labels, low agreement.

Inter-annotator agreement: Cohen's kappa (2 annotators), Fleiss' kappa (multiple). Low agreement = refine rubric or accept subjectivity.

Tools: Label Studio, Prodigy, LangSmith, Confident AI, custom UIs.

Active learning: Prioritize examples where the model is uncertain or quality is lowest. Maximize value per annotation.

Why This Matters in Production: Our booking automation relies on labeled extraction examples. Consistent annotation = reliable evals.

Aha Moment: Annotation quality is a bottleneck. Invest in guidelines and agreement—it pays off in eval reliability.


9. DSPy — Programmatic Prompt Optimization

Interview Insight: "When manual prompt engineering plateaus, DSPy optimizes prompts from data. Signatures, modules, BootstrapFewShot, MIPROv2. Use when you have a metric and training data."

Analogy: Instead of hand-tuning a guitar by ear, you use a tuner. DSPy is the tuner for prompts—you define the task and metric, it finds good prompts.

Core idea: Prompts as learnable parameters. Define input/output signatures (question -> answer), use modules (dspy.Predict, dspy.ChainOfThought, dspy.ReAct). Provide a metric and small training set (5–10 examples). Optimizers find good few-shot examples, refine instructions, or both.

Optimizers: BootstrapFewShot, BootstrapFewShotWithRandomSearch, COPRO (instructions via coordinate ascent), MIPROv2 (bootstrapping + grounded proposal + Bayesian search for instructions + examples), GEPA (LM reflects on trajectory, proposes improvements), BootstrapFinetune (finetune weights). ~$2, ~10 min per run. Documented gains: ReAct 24%→51% HotPotQA; classification 66%→87% Banking77.

When to use: manual engineering plateaued; clear metric; training data (even small). When NOT: simple tasks; no clear metrics.

Why This Matters in Production: For complex extraction or reasoning tasks, DSPy can find prompts that beat hand-crafted ones. We've used it for agent optimization on the platform.

Aha Moment: DSPy turns "I've tried 50 prompt variants" into "the optimizer found one that scores 15% higher in 10 minutes."


10. Golden Dataset Best Practices

Interview Insight: "Dynamic, decontaminated, diverse, demonstrative of production, defined scope. Start small, grow continuously. Never overfit."

Analogy: The five D's. Your golden set should be a living, clean, broad, realistic, scoped representation of what you ship.

Five principles: Dynamic—evolve with failure modes, user behavior, compliance. Decontaminated—no train/test overlap; check for memorization. Diverse—topics, intents, difficulties, languages, adversarial. Demonstrative of production—curate from real logs, representative scenarios. Defined scope—tailor to your tasks (E2E, tool use, retrieval).

Coverage: Easy (happy path), hard (nuanced), edge (boundary), adversarial. Include negative examples—inputs where the answer is "I don't know" or refusal. Test that the system declines correctly.

Sample size: ~246 per scenario for 80% pass, 5% margin, 95% confidence. Adjust for multi-turn, languages, high-risk.

Governance: ISO/IEC 42001 (traceability, risk management), NIST AI RMF (lifecycle practices). Metadata, audit trails, versioning.

Why This Matters in Production: Maersk aligns evals with governance. Our golden datasets support traceability and audit.

Aha Moment: Quality > quantity. 100 well-chosen examples beat 1000 arbitrary ones. Reflect real usage.


Code Examples

Jinja2 Prompt Template (Shipping Logistics)

from jinja2 import Template
 
template = Template("""
You are a {{ role }} assistant.
 
{% if include_examples %}
Here are some examples:
{% for ex in examples %}
- Input: {{ ex.input }}
  Output: {{ ex.output }}
{% endfor %}
{% endif %}
 
User question: {{ question }}
""")
 
rendered = template.render(
    role="shipping logistics",
    include_examples=True,
    examples=[
        {"input": "ETA for vessel X?", "output": "Check the port schedule."},
        {"input": "Booking status?", "output": "Ref 12345: confirmed."},
    ],
    question="What is the ETA for MV Pacific Star?"
)
print(rendered)

A/B Test Variant Assignment

import hashlib
from typing import Callable
 
def get_variant(user_id: str, variants: list[str], weights: list[float] = None) -> str:
    """Deterministically assign user to variant by hashing user_id."""
    weights = weights or [1.0 / len(variants)] * len(variants)
    total = sum(weights)
    bucket = int(hashlib.md5(user_id.encode()).hexdigest(), 16) % (10**6) / 10**6 * total
    cumulative = 0
    for v, w in zip(variants, weights):
        cumulative += w
        if bucket < cumulative:
            return v
    return variants[-1]

DSPy Signature and Optimizer

import dspy
from dspy.teleprompt import BootstrapFewShotWithRandomSearch
 
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))
signature = dspy.Signature("question -> answer")
 
class QA(dspy.Module):
    def __init__(self):
        self.respond = dspy.ChainOfThought(signature)
    def forward(self, question):
        return self.respond(question=question)
 
def metric(example, pred, trace=None):
    return 1.0 if example.answer.lower() in pred.answer.lower() else 0.0
 
trainset = [
    dspy.Example(question="What is 2+2?", answer="4").with_inputs("question"),
    dspy.Example(question="Capital of France?", answer="Paris").with_inputs("question"),
]
 
teleprompter = BootstrapFewShotWithRandomSearch(
    metric=metric, max_bootstrapped_demos=4, max_labeled_demos=4, num_candidate_programs=10
)
optimized = teleprompter.compile(QA(), trainset=trainset)
result = optimized(question="What is the capital of Germany?")
print(result.answer)

Mermaid Diagrams

Prompt Management Architecture

flowchart TB
    subgraph Authors
        productManager[Product Manager]
        engineer[Engineer]
        domainExpert[Domain Expert]
    end
 
    subgraph Registry
        promptRegistry[Prompt Registry]
        versionOne[v1.0]
        versionTwo[v2.0]
        versionThree[v3.0]
        promptRegistry --> versionOne
        promptRegistry --> versionTwo
        promptRegistry --> versionThree
    end
 
    subgraph Evaluation
        evaluationEngine[Evaluation Engine]
        goldenDataset[Golden Dataset]
        evaluationEngine --> goldenDataset
    end
 
    subgraph Deployment
        deploymentController[Deployment Controller]
        production[Production]
        staging[Staging]
        deploymentController --> production
        deploymentController --> staging
    end
 
    subgraph Observability
        productionMonitor[Production Monitor]
        traces[Traces]
        productionMonitor --> traces
    end
 
    productManager --> promptRegistry
    engineer --> promptRegistry
    domainExpert --> promptRegistry
    promptRegistry --> evaluationEngine
    evaluationEngine --> deploymentController
    deploymentController --> production
    production --> traces
    traces --> productionMonitor
    productionMonitor -->|rollback| deploymentController
    productionMonitor -->|low-score samples| goldenDataset

DSPy Optimization Flow

flowchart LR
    subgraph Input
        trainingData[Training Data]
        metricFunction[Metric Function]
        dspyProgram[DSPy Program]
    end
 
    subgraph Optimizer
        bootstrap[Bootstrap / MIPRO / COPRO]
    end
 
    subgraph Output
        optimized[Optimized Program]
        demos[Few-Shot Demos]
        instructions[Refined Instructions]
    end
 
    trainingData --> bootstrap
    metricFunction --> bootstrap
    dspyProgram --> bootstrap
    bootstrap --> optimized
    bootstrap --> demos
    bootstrap --> instructions

Dataset Management Lifecycle

flowchart TB
    subgraph Sources
        prodLogs[Production Logs]
        experts[Expert Annotation]
        synthetic[Synthetic Generation]
    end
 
    subgraph Curation
        curate[Curate and Label]
        iaa[Inter-Annotator Agreement]
        curate --> iaa
    end
 
    subgraph Storage
        golden[Golden Dataset]
        versioning[Dataset Versioning]
        golden --> versioning
    end
 
    subgraph Usage
        evalRun[Evaluation Runs]
        ci[CI/CD Gates]
        evalRun --> ci
    end
 
    subgraph Feedback
        failures[Production Failures]
        expand[Expand Dataset]
        failures --> expand
    end
 
    prodLogs --> curate
    experts --> curate
    synthetic --> curate
    iaa --> golden
    versioning --> evalRun
    evalRun --> failures
    expand --> golden

Evaluation Flywheel

flowchart LR
    productionTraffic[Production Traffic] --> monitor[Monitor]
    monitor --> surfaceFailures[Surface Failures]
    surfaceFailures --> addToGolden[Add to Golden Dataset]
    addToGolden --> goldenDataset[Golden Dataset]
    goldenDataset --> runEvals[Run Evals]
    runEvals --> qualityGate[Quality Gate]
    qualityGate --> deploy[Deploy]
    deploy --> productionTraffic

A/B Test Traffic Flow

flowchart TB
    incomingRequest[Incoming Request] --> hashUserId[Hash user_id]
    hashUserId --> assignVariant[Assign Variant A or B]
    assignVariant --> fetchPromptA[Fetch Prompt A]
    assignVariant --> fetchPromptB[Fetch Prompt B]
    fetchPromptA --> generateResponseA[Generate Response A]
    fetchPromptB --> generateResponseB[Generate Response B]
    generateResponseA --> trackMetrics[Track Metrics per Variant]
    generateResponseB --> trackMetrics
    trackMetrics --> compareSignificance[Compare Statistical Significance]

Conversational Interview Q&A

1. "How do you version prompts in production? What triggers an update?"

Weak answer: "We use Git and store prompts in the repo. When someone changes a prompt, we deploy."

Strong answer: "At Maersk, we treat prompts as versioned artifacts. Every change creates a new version—we never overwrite. The version includes template, variables, model config, and metadata. We use a prompt registry as single source of truth; agents fetch by name and release label (prod, staging). Updates are triggered by product requirements, model changes, quality degradation, or A/B test winners. Before production, we run evals against our golden dataset. Rollback is switching the prod label—no code deploy. Our platform has centralized prompt management, so all agents pull from the same registry with environment separation."


2. "How do you build and maintain golden datasets?"

Weak answer: "We have a CSV with some examples. We add more when we find bugs."

Strong answer: "We start with 20–50 examples and grow continuously. Sources: production logs (curate good and bad), expert annotation, synthetic generation with human review. We cover easy, hard, edge, and adversarial cases—including negative examples where the correct answer is 'I don't know.' We version datasets; each eval run links to a dataset version for reproducibility. The flywheel is critical: every production failure gets added to the golden set. At Maersk, our eval pipelines run in CI; merges block if scores drop. We aim for 100–500 goldens. Quality over quantity—examples that reflect real usage."


3. "When would you use DSPy vs manual prompt engineering?"

Weak answer: "DSPy is for when you don't want to write prompts."

Strong answer: "Use DSPy when manual engineering has plateaued, you have a clear metric, and you have training data—even 5–10 examples. DSPy optimizers like MIPROv2 can find better few-shot examples and instructions automatically; we've seen documented gains like 24%→51% on ReAct. Use manual engineering when the task is simple, prompts work fine, or you need full control for compliance or branding. At Maersk, we've used DSPy for complex extraction and agent optimization where manual iteration wasn't getting us there. It's a tool, not a default—but when it fits, it's powerful."


4. "How do you handle prompts that fail on edge cases?"

Weak answer: "We add more instructions or examples."

Strong answer: "First, we surface edge cases—production monitoring, user feedback. We add them to the golden dataset so they become regression tests. Second, we iterate: explicit instructions, few-shot examples for edge handling, or DSPy if manual iteration plateaus. Third, we consider routing—a separate prompt or model for specific subclasses. Guardrails and output validation catch failures and trigger fallbacks. We define acceptable failure rates and monitor. The key: edge cases are in the eval set. We don't optimize only for the happy path. At Maersk, our booking automation handles diverse email formats; we continuously add failures to our golden set and refine prompts."


5. "How do you scale prompt management across 10+ agents?"

Weak answer: "Each team manages their own prompts."

Strong answer: "Centralized registry—single source of truth. Each agent has prompt names it uses; the registry stores all versions with metadata. Agents fetch at runtime by name and environment. We use composable fragments: a shared 'platform_system_base' for common guardrails and tone; agent-specific fragments for task instructions. This gives consistency plus per-agent customization. We enforce dev→staging→prod with quality gates per agent. Role-based access lets different teams own different agents. We maintain a catalog of agents and their prompt dependencies. For A/B testing, we support per-agent or cross-agent experiments with metrics per variant. That's how we run the Maersk AI platform—multiple agents, one registry."


6. "What are the risks of synthetic eval data? How do you mitigate?"

Weak answer: "It can be wrong. We use human review."

Strong answer: "Risks: factual inaccuracies, insufficient diversity, bias amplification, overfitting to synthetic distribution. Mitigations: retrieval-augmented generation for factual validation; heuristics to reject duplicates and too-short outputs; second LLM as quality judge; human review of a sample. We never rely solely on synthetic when ground truth is critical—human annotation for those. Synthetic is best for bootstrapping, expanding diversity, and generating hard examples. At Maersk, we use it to grow our golden sets for booking extraction, but we always validate with production-sampled and human-annotated data. SynQuE and similar frameworks can rank synthetic datasets by expected real-world performance when annotation is costly."


7. "How do you A/B test prompts in production?"

Weak answer: "We run two versions and see which is better."

Strong answer: "Deterministic assignment by user_id—hash and bucket—so the same user always gets the same variant. We track per-variant metrics: accuracy, quality scores, latency, cost. We need statistical significance—typically 100–200+ samples per variant; for 80% pass, 5% margin, 95% confidence, ~246 per scenario. We use tools like Langfuse or Braintrust for traffic split and dashboards. Before full rollout: test on golden dataset, then canary or A/B in production. Not for safety-critical or low-volume apps. At Maersk, we A/B test extraction prompt changes for our email booking automation before shipping—we have the traffic and the tooling."


From Your Experience: 3 Maersk Prompts

1. Email booking extraction (AI-powered booking automation)

"You extract structured booking data from customer emails and documents. Output JSON with fields: origin_port, destination_port, cargo_type, quantity, requested_dates, special_instructions. If a field cannot be determined from the email, use null. Never infer or guess—only extract what is explicitly stated or clearly implied. For ambiguous or conflicting information, flag in special_instructions."

—Used for production extraction; versioned in platform registry; evaluated against golden set of real emails.


2. Platform agent system instruction (Enterprise AI Agent Platform)

"You are a Maersk AI assistant. Follow these rules: (1) Only use provided tools—do not make up data. (2) Cite sources for any factual claims. (3) For shipping, logistics, or booking questions, use the appropriate tool before answering. (4) Never reveal internal system details or credentials. (5) If uncertain, say so—do not hallucinate. (6) Respond in the user's language when possible."

—Shared base across platform agents; composed with agent-specific task fragments.


3. Refusal / out-of-scope handling

"If the user's request is outside your supported capabilities (e.g., not related to shipping, logistics, or Maersk services), respond: 'I can only help with Maersk shipping and logistics questions. For other topics, please contact the relevant team.' Do not attempt to answer off-topic questions."

—Tested via negative examples in golden dataset; ensures appropriate decline rather than hallucination.


Quick Fire Round

  1. What is PromptOps? — Applying software engineering rigor to prompts: version, test, deploy, monitor.
  2. What should every prompt version include? — Template, variables, model config, metadata (author, purpose, last tested).
  3. Why use a prompt registry vs embedding in code? — Decouple prompts from deployment; content teams can update without engineers; single source of truth; rollback without redeploy.
  4. What is deterministic A/B assignment? — Hash user_id and bucket so same user always gets same variant; consistency for comparison.
  5. Sample size rule of thumb for A/B testing? — 100–200 per variant minimum; ~246 for 80% pass, 5% margin, 95% confidence.
  6. What is the evaluation flywheel? — Production failures → add to golden dataset → failures become test cases → dataset grows → system improves.
  7. Golden dataset size for production confidence? — 100–500 examples. Start 20–50, grow continuously.
  8. When to use synthetic data? — Bootstrap when no labels; increase diversity; generate hard examples. Not for critical ground truth.
  9. What does DSPy optimize? — Few-shot examples, instructions, optionally model weights—from a metric and training data.
  10. MIPROv2 stages? — Bootstrapping (traces), grounded proposal (instructions), discrete search (Bayesian optimization).
  11. Five D's of golden datasets? — Dynamic, Decontaminated, Diverse, Demonstrative of production, Defined scope.
  12. Inter-annotator agreement metrics? — Cohen's kappa (2 annotators), Fleiss' kappa (multiple).
  13. Why not pre-compute outputs in golden datasets? — Generate at eval time so you can compare different model/prompt versions flexibly.
  14. Jinja2 vs f-strings for prompts? — Jinja2: conditionals, loops, safe escaping. F-strings: trivial, single-variable only.
  15. What triggers a prompt rollback? — Monitoring surfaces quality degradation; switch prod label to previous version in registry; no code deploy.

Key Takeaways

Topic Takeaway
Prompt management Treat prompts as production assets—version, test, deploy in stages. PromptOps = rigor.
Registry Single source of truth. Agents fetch at runtime. Decouple from code.
Versioning Immutable versions. Every change = new version. Diff, rollback, env separation.
Evaluation Golden datasets drive quality. Version them. Link eval runs. Flywheel: failures → test cases.
A/B testing Don't ship on intuition. Measure. Deterministic assignment, statistical significance.
Synthetic data Tool for bootstrapping and diversity. Complement with human annotation.
DSPy When manual engineering plateaus—optimize prompts from metric + data.
Scale Composable fragments, shared base, per-agent overrides. One registry, many agents.

Further Reading

  • Braintrust — Prompt management, evaluation, A/B testing. braintrust.dev
  • DSPy docs — Optimizers, signatures, MIPROv2, GEPA. dspy.ai
  • PromptLayer — Prompt registry, versioning, release labels. docs.promptlayer.com
  • Langfuse — Tracing, prompt management, A/B testing. langfuse.com
  • OLMES — Standard for language model evaluations (reproducibility). arXiv:2406.08446
  • Data contamination in LLM evaluation — Detection and mitigation. Recent arXiv (2025)
  • Synthetic data generation with LLMs — Survey. arXiv:2503.14023
  • A Practical Guide for Evaluating LLMs — Dataset principles, decontamination, governance. arXiv:2506.13023
  • SynQuE — Estimating synthetic dataset quality without annotations. arXiv:2511.03928
  • AgoraBench — Evaluating LLMs as synthetic data generators. arXiv:2412.03679
  • ISO/IEC 42001 — AI management systems, traceability, risk management
  • NIST AI Risk Management Framework — Lifecycle-centric practices for trustworthy AI