AI Engineer Prep

Session 17: Fine-Tuning — PEFT, LoRA & Training Pipelines

Everyone wants to fine-tune. Almost nobody needs to. Fine-tuning is the nuclear option of LLM customization—powerful, expensive, and usually overkill. But when you do need it, nothing else comes close. Prompt engineering and RAG solve 90% of enterprise use cases. Fine-tuning is for the 10% where you've exhausted both and the model still doesn't behave the way you need.

At Maersk, we built the Enterprise AI Agent Platform with centralized LLMs, guardrails, and evaluations. Most of our customization lives in prompts, tools, and RAG. But when we hit cases—extraction accuracy that prompt engineering couldn't fix, domain terminology the base model mangled, or latency-sensitive deployments that needed a smaller model—we had to know the fine-tuning playbook. This session covers when to pull that lever, how LoRA and QLoRA make it practical on consumer hardware, and the full pipeline from data prep to deployment. You'll leave knowing exactly when fine-tuning beats prompt engineering and RAG, and how to do it without blowing your budget.


1. When Fine-Tuning Beats Prompt Engineering + RAG

Interview Insight: The interviewer is testing whether you can make trade-off decisions, not whether you know what fine-tuning is. "We fine-tuned because we could" is the wrong answer. You need a decision framework.

Think of it like renovating a house. Prompt engineering is rearranging the furniture—fast, reversible, cheap. RAG is installing a bookshelf and filling it with reference manuals—adds knowledge without changing the house. Fine-tuning is knocking down walls and rewiring—expensive, permanent, and you'd better be sure you need it. You don't knock down walls because you could use more shelf space.

When to fine-tune:

  1. Consistent custom behavior — Tone, format, domain terms the model keeps getting wrong. Example: you need every shipping confirmation email to start "Dear Valued Partner," include a specific disclaimer, and use "TEU" not "twenty-foot equivalent unit." Prompts work sometimes; fine-tuning makes it consistent.

  2. Style or format prompts can't achieve — The model keeps slipping into casual language when you need formal, or vice versa. Few-shot helps but doesn't scale. Fine-tuning bakes the style in.

  3. Latency-sensitive deployments needing smaller models — You need sub-100ms response with a 7B model that behaves like your custom use case. Off-the-shelf 7B models don't know your schema; prompting adds tokens and latency. Fine-tune a small model for your narrow task.

  4. Extraction accuracy — Structured extraction from emails, invoices, or forms where schema adherence and field accuracy matter. If RAG + prompt engineering yields 85% and you need 98%, fine-tuning can close the gap.

When NOT to fine-tune:

  • RAG handles knowledge — New policies, updated rates, carrier info. That's retrieval, not weights.
  • Prompt engineering handles behavior — Most format tweaks, few-shot examples, system prompts. Try these first.
  • You have <100 high-quality examples — Fine-tuning on junk or tiny data makes things worse.
  • The task changes often — Fine-tuning locks you in. Prompts are easy to update.

Why This Matters in Production: At Maersk, our email booking agent used RAG for templates and prompt engineering for extraction format. We only considered fine-tuning when we hit a ceiling on port code accuracy—the model kept confusing similar codes despite retrieval. That's the bar: exhaust simpler options, measure the gap, then decide.

Aha Moment: Fine-tuning is a last resort, not a first step. The best engineers know when not to fine-tune.


2. Full Fine-Tuning vs Parameter-Efficient Methods (PEFT)

Interview Insight: PEFT is the default for most production fine-tuning in 2024–2025. Full fine-tuning is the exception. Interviewers expect you to know both and when to pick each.

Full fine-tuning updates every weight in the model. For a 7B model, that's 7 billion parameters. You need lots of data (tens of thousands of examples), lots of GPU memory (multiple A100s for 7B at FP16), and you risk catastrophic forgetting—the model loses general capabilities and overfits to your narrow task. Like rewriting an entire encyclopedia because you want to add one paragraph.

PEFT freezes most of the base model and trains only small adapter layers. You're adding sticky notes, not rewriting the book. Benefits: far fewer trainable parameters, less data needed, lower memory, base knowledge preserved. Trade-off: some tasks need full model capacity; PEFT can hit ceilings.

Types of PEFT:

Method Idea When to Use
LoRA Low-rank decomposition of weight updates Default. Best balance of quality and efficiency.
Prefix Tuning Prepend learnable prompt vectors When you want prompt-like behavior, trainable.
Prompt Tuning Soft prompt embeddings only Cheaper than LoRA, often lower quality.
Adapters Small bottleneck layers (Houlsby, etc.) Older approach; LoRA usually wins now.

Why This Matters in Production: For enterprise workloads, LoRA is the default. Full fine-tuning is for research or when you have massive proprietary datasets and need every last percent of accuracy.

Aha Moment: PEFT isn't a compromise—it's often better than full fine-tuning because it preserves the base model and reduces overfitting.


3. LoRA Mechanics

Interview Insight: You should be able to draw the LoRA math on a whiteboard. W + BA. Rank r. Why r << d. That's table stakes for a Senior AI Engineer fine-tuning discussion.

Instead of updating the full weight matrix W (d×k), LoRA decomposes the update as ΔW = BA, where B is d×r and A is r×k, with r << d. At forward pass: output = (W + BA)x = Wx + B(Ax). You freeze W and only train A and B. For a 4096×4096 layer, that's 33M parameters vs 4M for r=8. Huge savings.

Rank selection: r=4 to r=64 typical. r=8 or r=16 for most tasks. Higher rank = more capacity, more parameters, more overfitting risk. Start low; increase if underfitting.

Target modules: Usually attention layers—q_proj, v_proj, k_proj, o_proj. Some configs add mlp (feed-forward). Targeting only q_proj and v_proj is a common minimal setup. All linear layers = max capacity, max params.

Merge after training: LoRA adapters add inference overhead (extra matmuls). You can merge the adapters into the base weights: W_new = W + BA. Then inference is identical to the base model—zero overhead. Do this before deployment.

Why This Matters in Production: LoRA lets you train a 7B model on a single A100 or even an RTX 4090. Merge adapters for production to avoid latency bloat.

Aha Moment: LoRA works because of the low-rank hypothesis: weight updates during adaptation lie in a low-dimensional subspace. We're not losing much by compressing them.


4. QLoRA

Interview Insight: QLoRA is how you fine-tune on consumer hardware. 7B model on 24GB? QLoRA. Know the ingredients: 4-bit quant base, fp16 adapters, double quantization, paged optimizers.

QLoRA = Quantized LoRA. Quantize the base model to 4-bit (NF4—NormalFloat4, designed for neural network weights). Apply LoRA adapters in fp16/bf16. The base model stays tiny in memory; only the adapters need full precision. Result: train 7B on a single 24GB GPU (e.g., RTX 4090, A10, or consumer cards).

Double quantization: Quantize the quantization constants (scaling factors) to save another few hundred MB. Usually worth it.

Paged optimizers: Offload optimizer states to CPU when GPU memory spikes (e.g., during gradient accumulation). Prevents OOM. Slows training slightly but unlocks training on memory-constrained machines.

Caveats: 4-bit base = some quality loss. For most adaptation tasks, it's negligible. If you need every last bit of fidelity, use regular LoRA with FP16 base (more VRAM).

Why This Matters in Production: This is how startups and enterprises without A100 clusters fine-tune. At Maersk, we could prototype extraction models on a single GPU before scaling to training clusters.

Aha Moment: QLoRA democratized fine-tuning. You don't need a cloud GPU budget to experiment.


5. Hugging Face Ecosystem

Interview Insight: HF is the de facto standard. Transformers, PEFT, TRL, Datasets, Accelerate—you should know what each does and how they fit together.

Library Role
Transformers Model loading, tokenizers, pipelines
PEFT LoRA config, get_peft_model, adapter management
TRL SFTTrainer, DPO Trainer, RLHF pipelines
Datasets Load, format, split, streaming
Accelerate Multi-GPU, mixed precision, gradient accumulation

Typical flow: Load model with transformers.AutoModelForCausalLM, wrap with peft.get_peft_model and LoraConfig, train with trl.SFTTrainer, evaluate, merge with model.merge_and_unload(), save.

flowchart LR
    subgraph Load[Load & Configure]
        base[Base Model]
        lora[LoraConfig]
        peft[get_peft_model]
        base --> lora --> peft
    end
    subgraph Train[Train]
        data[Dataset]
        sft[SFTTrainer]
        peft --> sft
        data --> sft
    end
    subgraph Deploy[Deploy]
        merge[merge_and_unload]
        save[Save merged model]
        sft --> merge --> save
    end

Why This Matters in Production: This stack is what you'll use in 95% of fine-tuning projects. Alternatives (Axolotl, LLaMA-Factory) build on the same primitives.

Aha Moment: SFTTrainer handles chat formatting, packing, and logging. Don't roll your own training loop unless you have a reason.


6. Data Requirements

Interview Insight: "How much data do you need?"—expect this. The answer is always "it depends," but you need concrete ranges and the importance of quality.

How much data:

  • LoRA: Hundreds to low thousands of high-quality examples. 500–2000 is typical for narrow tasks. 5000+ for broader behavior.
  • Full fine-tuning: Tens of thousands. 50K+ for meaningful full adaptation.
  • QLoRA: Same as LoRA; quantization doesn't change data needs.

Data quality > quantity. 500 clean, diverse, correctly formatted examples beat 10,000 messy ones. Garbage in, garbage out. Deduplicate. Remove near-duplicates (embed and cluster; keep one per cluster). Check for label noise, format drift, and bias.

Formats:

  • Instruction/response pairs: {"instruction": "...", "response": "..."} — most common for SFT.
  • Chat format: [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}] — for multi-turn or chat models.
  • Completion format: Raw text continuations for causal LM training (less common for instruction models).

Why This Matters in Production: At Maersk, extraction fine-tuning would use (email snippet, structured JSON) pairs. We'd need hundreds of real or synthetic examples with correct schema. Quality mattered more than volume.

Aha Moment: Spend 80% of your time on data. The training loop is a commodity.


7. Training Pipeline Design

Interview Insight: Interviewers want to see you think in pipelines, not one-off scripts. Data → format → split → config → train → eval → merge → deploy.

Pipeline stages:

  1. Data prep — Load, clean, deduplicate, validate format.
  2. Format conversion — Convert to chat or instruction format expected by the tokenizer (e.g., ChatML, Llama chat template).
  3. Train/val split — 90/10 or 95/5. Stratify if you have categories.
  4. Configure LoRA — Rank, alpha, target modules, dropout.
  5. Train with SFTTrainer — Learning rate, epochs, batch size, gradient accumulation.
  6. Evaluate — On held-out set and (if possible) general benchmarks to check forgetting.
  7. Merge adaptersmerge_and_unload() for zero-overhead inference.
  8. Deploy — Export to vLLM, Triton, or API server.

Key hyperparameters:

Param Typical range Notes
Learning rate 1e-4 to 2e-5 LoRA: often 2e-5. Higher = faster, risk of instability.
Epochs 1–5 2–3 common. Watch for overfitting.
Batch size 2–8 per device Use gradient accumulation for effective batch.
Gradient accumulation 4–16 Effective batch = per_device × accumulation × devices.
Warmup 3–10% Stabilizes early training.

Why This Matters in Production: A reproducible pipeline beats ad-hoc experimentation. Version your data, config, and checkpoints. Use MLflow or similar to track runs.

Aha Moment: Early stopping on validation loss prevents overfitting. Monitor both task accuracy and general capability (e.g., MMLU sample) to catch forgetting.


8. Catastrophic Forgetting

Interview Insight: "What is catastrophic forgetting?" — Standard question. You need the definition, why it happens, and mitigations.

The model loses general capabilities after fine-tuning on narrow data. It was great at reasoning, coding, and general knowledge; after fine-tuning on shipping confirmations, it forgets how to do basic math or follows your format but can't generalize. Like training a doctor to specialize in cardiology and having them forget how to take blood pressure.

Why: Gradient updates optimize for your task. General knowledge gets overwritten.

Mitigations:

  1. LoRA (or PEFT) — Base weights stay frozen. Only adapters change. Major protection.
  2. Mix general data — Add 5–20% general instruction data (Alpaca, ShareGPT) to the mix. Anchors general capabilities.
  3. Early stopping — Stop before overfitting. More epochs = more forgetting.
  4. Evaluation — Run general benchmarks (MMLU, HumanEval sample) alongside task eval. If general drops, you've forgotten.

Why This Matters in Production: Fine-tuning for extraction shouldn't kill the model's ability to handle edge cases or reason. Monitor both task and general metrics.

Aha Moment: LoRA is your first line of defense. It's not just about efficiency—it's about preserving the base model.


9. SFT vs RLHF vs DPO

Interview Insight: SFT is what you'll do 90% of the time. RLHF and DPO are for alignment and preference learning. Know the difference and when each applies.

Method What it does When to use
SFT Supervised Fine-Tuning. Train on (input, desired output) pairs. Standard instruction tuning. Default. Most adaptation tasks.
RLHF Reinforcement Learning from Human Feedback. Train a reward model on preferences, then optimize policy (model) with PPO. Alignment. Used by OpenAI, Anthropic for Chat models. Expensive, complex.
DPO Direct Preference Optimization. Uses preference pairs (chosen vs rejected) directly—no separate reward model. Simpler alternative to RLHF. When you have preference data and want better alignment.

SFT — You have correct answers. Train the model to produce them. Extraction, classification, format adherence, domain terminology. This is the bread and butter.

RLHF — You have preferences (A is better than B) but not necessarily "correct" answers. Train a reward model to score outputs, then use PPO to maximize reward. Powerful but needs lots of preference data, reward model training, and PPO tuning. Rarely worth it for enterprise unless you're building a ChatGPT competitor.

DPO — You have (prompt, good_response, bad_response) triples. DPO trains the policy directly to prefer good over bad. No reward model. Simpler than RLHF, often similar results. Use when you have preference data (e.g., human ratings of model outputs) and want to improve alignment.

Why This Matters in Production: For Maersk-style extraction and booking, SFT is sufficient. DPO could help if we had human feedback on "this extraction is better than that one" and wanted to optimize for that. RLHF is overkill.

Aha Moment: Start with SFT. Add DPO only if you have preference data and need alignment beyond what SFT gives.


10. Cost and Infrastructure

Interview Insight: "How much does fine-tuning cost?" — Have numbers. LoRA on 7B: single A100 or RTX 4090. Full FT: multi-GPU. Cloud options matter.

GPU sizing:

Setup Model size Hardware
LoRA / QLoRA 7B Single A100 (40GB), RTX 4090 (24GB), A10
LoRA / QLoRA 13B A100 40GB or 80GB
LoRA 70B 2–4x A100 80GB
Full FT 7B 2–4x A100
Full FT 70B 8+ A100 80GB

Cloud options: Lambda Labs, RunPod, vast.ai (cheap spot GPUs), AWS SageMaker, GCP Vertex AI. Lambda and RunPod often cheaper for short bursts. SageMaker for enterprise integration.

Training time (rough): LoRA 7B, 1K examples, 3 epochs—~30 min to 2 hours on A100. Full FT—hours to a day.

Cost comparison vs API: Fine-tuning has upfront cost (GPU hours) but then inference is cheap if self-hosted. API: pay per token forever. Rule of thumb: if you're doing millions of inferences on the same task, fine-tuning can pay off. If volume is low or task shifts often, API wins.

Why This Matters in Production: Budget before you start. A $50 RunPod A100 for a few hours is fine for experimentation. Production training might need reserved instances and reproducibility.

Aha Moment: QLoRA on a 24GB card = fine-tune 7B for under $20 in cloud compute. That changes the calculus for prototyping.


11. Practical Examples

Interview Insight: Be ready to give concrete examples. Extraction, classification, domain terms, format, code. These map to real interview questions.

Use case Why fine-tune What you gain
Extraction accuracy Schema and field accuracy matter. Prompts get 85%, you need 98%. Consistent JSON structure, fewer field errors, better handling of edge formats.
Classification Custom labels, nuanced categories. Off-the-shelf models don't know your taxonomy. Higher precision/recall on your labels.
Domain terminology Model keeps saying "container" when you need "TEU" or "FCL." Correct jargon, fewer corrections.
Response format Email template, disclaimer, specific structure. Prompts drift. Format baked in. No prompt tokens for format.
Code generation on internal APIs Model doesn't know your SDK, conventions, or internal endpoints. Correct imports, API calls, patterns.

For Maersk email booking: extraction (carrier, port, date, rate from emails), format (booking confirmation template), domain terms (TEU, FCL, B/L, etc.). Fine-tuning could target all three if prompt engineering hit a ceiling.

Why This Matters in Production: Each example has a clear metric. Extraction: field-level accuracy. Format: template adherence. Terms: term error rate. Measure before and after.

Aha Moment: Fine-tune for one thing per run when possible. Multi-objective fine-tuning is harder to tune and evaluate.


12. OpenAI / Anthropic Fine-Tuning APIs

Interview Insight: Provider fine-tuning vs self-hosted is a trade-off. Know when to use each.

OpenAI fine-tuning — GPT-4o, GPT-4o-mini, etc. Upload JSONL, they train. You get a fine-tuned model ID, use it like base model. Pay per token (usually slightly higher than base). No GPU management. Limited control over hyperparameters and architecture.

Anthropic Claude fine-tuning — Similar. Upload data, they train. Limited model availability (check current offerings). Same trade-off: convenience vs control.

When to use provider fine-tuning:

  • You want zero infra. No GPU, no training pipeline.
  • Your data can be sent to the provider (privacy/compliance OK).
  • You're fine with their pricing and model versioning.
  • You don't need custom architectures (e.g., specific LoRA config).

When to use self-hosted:

  • Data must stay on-prem or in your cloud.
  • You need full control: LoRA rank, target modules, custom data formats.
  • Volume is high enough that inference cost savings from self-hosting matter.
  • You're fine-tuning open-source (Llama, Mistral, etc.) that providers don't offer.

Why This Matters in Production: Maersk likely has data residency constraints—sending customer emails to OpenAI for fine-tuning may be a no-go. Self-hosted with QLoRA on internal infra is the default for sensitive data.

Aha Moment: Provider fine-tuning is a fast path. Self-hosted is the flexible path. Pick based on data, compliance, and control needs.


Mermaid Diagrams

Fine-Tuning Decision Flow

flowchart TD
    need[Need Customization]
    pe[Try Prompt Engineering]
    peOk{Good Enough?}
    rag[Add RAG]
    ragOk{Good Enough?}
    ft[Fine-Tune]
    need --> pe
    pe --> peOk
    peOk -->|Yes| done[Ship It]
    peOk -->|No| rag
    rag --> ragOk
    ragOk -->|Yes| done
    ragOk -->|No| ft
    ft --> done

LoRA Weight Update

flowchart LR
    subgraph Full["Full Update (expensive)"]
        W[W d x k]
        deltaW["Delta W d x k"]
        Wnew["W' = W + Delta W"]
        W --> deltaW --> Wnew
    end
 
    subgraph LoRA["LoRA (efficient)"]
        W2[W frozen]
        B["B d x r"]
        A["A r x k"]
        BA["BA = Delta W"]
        W2 --> B
        B --> A --> BA
    end

Training Pipeline

flowchart TB
    data[Data Prep]
    format[Format Conversion]
    split[Train/Val Split]
    config[LoRA Config]
    train[SFTTrainer]
    eval[Evaluate]
    merge[Merge Adapters]
    deploy[Deploy]
 
    data --> format --> split --> config --> train --> eval --> merge --> deploy

SFT vs DPO vs RLHF

flowchart TB
    subgraph sft[SFT]
        sftIn["(input, output) pairs"]
        sftTrain[Train on correct answers]
        sftIn --> sftTrain
    end
 
    subgraph dpo[DPO]
        dpoIn["(prompt, chosen, rejected)"]
        dpoTrain[Direct preference optimization]
        dpoIn --> dpoTrain
    end
 
    subgraph rlhf[RLHF]
        rm[Train reward model]
        ppo[PPO policy optimization]
        rm --> ppo
    end

QLoRA Memory Layout

flowchart TB
    subgraph gpu[GPU Memory]
        base4["Base Model 4-bit"]
        adapters["LoRA Adapters fp16"]
        opt["Optimizer States"]
    end
    subgraph cpu[CPU optional]
        paged["Paged Optimizer"]
    end
    base4 --> adapters --> opt
    opt -.->|"if OOM"| paged

Code Examples

LoRA Config with PEFT

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
 
model_name = "meta-llama/Llama-3.2-3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
 
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)
 
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4.2M || all params: 2.9B || trainable%: 0.14

SFTTrainer Setup

from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
 
dataset = load_dataset("json", data_files="train.jsonl", split="train")
 
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=4,
    packing=True,
    args=SFTConfig(
        output_dir="./finetuned",
        num_train_epochs=3,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        learning_rate=2e-5,
        warmup_ratio=0.03,
        logging_steps=10,
        save_strategy="epoch",
        fp16=True,
    ),
)
 
trainer.train()
trainer.save_model("./finetuned")

Chat Format Data

def format_chat(example):
    messages = [
        {"role": "user", "content": example["instruction"]},
        {"role": "assistant", "content": example["response"]},
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=False,
    )
    return {"text": text}
 
dataset = dataset.map(format_chat, remove_columns=dataset.column_names)
# Use dataset_text_field="text" in SFTTrainer

Inference with Merged Model

from peft import PeftModel
 
# Option 1: Merge adapters for zero-overhead inference
model = model.merge_and_unload()
model.save_pretrained("./merged_model")
tokenizer.save_pretrained("./merged_model")
 
# Option 2: Load merged model and run inference
from transformers import pipeline
 
pipe = pipeline("text-generation", model="./merged_model", tokenizer=tokenizer)
output = pipe(
    "Extract carrier, port, and date from: Ship via Maersk from LAX to Rotterdam, ETA 2024-03-20",
    max_new_tokens=128,
    temperature=0,
)
print(output[0]["generated_text"])

QLoRA with BitsAndBytes

from transformers import BitsAndBytesConfig
 
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16",
    bnb_4bit_use_double_quant=True,
)
 
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)
model = get_peft_model(model, lora_config)
# Now train with SFTTrainer - fits on 24GB GPU

Conversational Interview Q&A

1. "When would you fine-tune instead of using RAG or prompt engineering?"

Weak answer: "When we need better accuracy or custom behavior." (Too vague. Doesn't give a decision framework.)

Strong answer: "I'd exhaust prompt engineering and RAG first. Fine-tune when: (1) we need consistent format or tone that prompts can't achieve—e.g., every shipping confirmation must start with a specific disclaimer. (2) extraction accuracy hits a ceiling—we're at 85% with RAG+prompts and need 98%. (3) latency-sensitive deployment needing a smaller model that behaves like our task—fine-tune 7B for extraction so we don't need the 70B API. At Maersk, our email booking agent used RAG for templates. We'd only consider fine-tuning if we hit a ceiling on port code or carrier extraction accuracy that we couldn't fix with better prompts or retrieval."

2. "Explain LoRA. Why does it work?"

Weak answer: "LoRA trains small adapter layers instead of the full model. It's parameter-efficient." (Missing the math and the intuition.)

Strong answer: "LoRA decomposes the weight update as ΔW = BA, where B is d×r and A is r×k, r << d. Instead of updating 33M parameters in a 4096×4096 layer, we train maybe 65K for r=8. The low-rank hypothesis: adaptations during fine-tuning lie in a low-dimensional subspace. We're not losing much. We freeze the base W and only train A and B. After training, we merge: W' = W + BA. Inference is identical to base model—zero overhead. It works because most task-specific updates are compressible."

3. "What's the difference between QLoRA and LoRA?"

Weak answer: "QLoRA uses quantization." (Doesn't explain what or why.)

Strong answer: "QLoRA quantizes the base model to 4-bit (NF4) while keeping LoRA adapters in fp16. The base model uses a fraction of the memory—a 7B model goes from ~14GB to ~4GB. That lets you train on a 24GB consumer GPU. Double quantization squeezes the scaling factors too. Paged optimizers offload optimizer states to CPU on memory spikes. Trade-off: 4-bit base has slight quality loss, but for most adaptation tasks it's negligible. QLoRA democratized fine-tuning—you don't need A100s."

4. "How much data do you need for LoRA?"

Weak answer: "A few thousand examples." (No nuance on quality vs quantity.)

Strong answer: "Hundreds to low thousands for narrow tasks—500 to 2000 high-quality examples is typical. Quality trumps quantity. 500 clean, diverse, correctly formatted examples beat 10K messy ones. For broader behavior or style, 5K+. Deduplicate, remove near-duplicates, check for label noise. For our Maersk extraction use case, we'd need hundreds of (email, correct JSON) pairs. I'd spend most of the effort on data curation—the training loop is straightforward."

5. "What is catastrophic forgetting and how do you mitigate it?"

Weak answer: "The model forgets things. We use LoRA to reduce it." (Incomplete. No other mitigations.)

Strong answer: "Catastrophic forgetting is when fine-tuning on narrow data erodes general capabilities—the model was good at reasoning, now it can't. LoRA is the main mitigation: we freeze the base, only train adapters. Beyond that: mix in 5–20% general instruction data to anchor capabilities, use early stopping to avoid overfitting, and evaluate on general benchmarks (MMLU sample) alongside task metrics. If general drops, we've forgotten. At Maersk, we'd monitor both extraction accuracy and a small set of general questions."

6. "When would you use provider fine-tuning (OpenAI/Anthropic) vs self-hosted?"

Weak answer: "Depends on cost and control." (Missing data residency, compliance, volume.)

Strong answer: "Provider fine-tuning: zero infra, fast iteration, data can leave your environment. Good when you're prototyping or when compliance allows. Self-hosted: when data must stay on-prem or in your cloud—Maersk has data residency requirements, so customer emails likely can't go to OpenAI. Also when you need full control—custom LoRA config, open-source models, specific data formats. And when inference volume is high enough that self-hosting pays off. Rule of thumb: sensitive data or high volume → self-host. Quick experiment, no compliance concerns → provider."

7. "Walk me through a fine-tuning pipeline from data to deployment."

Weak answer: "We prepare data, train with SFTTrainer, and deploy." (No detail.)

Strong answer: "Data prep: load, clean, deduplicate. Format: convert to chat or instruction format the tokenizer expects. Split 90/10. Configure LoRA—r=16, target q_proj/v_proj/k_proj/o_proj. SFTTrainer with 2e-5 LR, 3 epochs, batch size 2, grad accum 8. Evaluate on val set and a sample of general benchmarks. Merge adapters with merge_and_unload for zero-overhead inference. Save. Deploy to vLLM or our API layer. Version data, config, and checkpoint in MLflow. At Maersk we'd add evals for extraction field accuracy before and after to measure lift."


From Your Experience (Maersk Prompts)

When they ask "tell me about a time you..."—you've got these. Tailor to your actual experience.

"Have you fine-tuned models at Maersk? When would you have considered it for the email booking agent?"
Most of our customization is prompt engineering and RAG. We'd consider fine-tuning if extraction accuracy hit a ceiling—e.g., port code confusion (LAX vs Lax) or carrier name variants that prompts couldn't fix. Or if we needed a smaller, faster model for edge deployment with the same extraction behavior. The decision would come after exhausting better prompts, more few-shot examples, and improved retrieval. Fine-tuning is the last resort.

"How would you design a fine-tuning pipeline for the Enterprise AI Agent Platform?"
We'd need a clear use case first—e.g., extraction, classification, or format adherence for a specific agent. Data: curate instruction/response or chat-format pairs, quality over quantity. Use PEFT (LoRA or QLoRA) to preserve base capabilities. Train with SFTTrainer, evaluate on task metrics plus a sample of general benchmarks. Integrate with MLflow for experiment tracking. Merge adapters and deploy via our model serving layer. Guardrails and evaluations would run on fine-tuned models the same as base models. We'd only do this for workloads where prompt engineering had hit a limit.

"What infrastructure would you use for fine-tuning at Maersk—cloud or on-prem?"
Given data residency, on-prem or approved cloud (e.g., our own Azure/AWS) is likely required. We'd use QLoRA for 7B models on 24GB cards—single A10 or A100 per run. For larger models or full fine-tuning, we'd need a multi-GPU cluster. Training would be episodic—not continuous—so spot or reserved instances depending on urgency. We'd track cost per run and compare to API costs to justify when fine-tuning pays off.


Quick Fire Round

  1. When does fine-tuning beat RAG? When you need consistent behavior (tone, format, terms) or extraction accuracy that prompts can't achieve. Not for knowledge—that's RAG.
  2. What is PEFT? Parameter-Efficient Fine-Tuning. Freeze most weights, train small adapters. LoRA, Prefix Tuning, Prompt Tuning.
  3. LoRA formula? ΔW = BA. B d×r, A r×k. r << d. Train A and B only.
  4. Typical LoRA rank? r=4 to r=64. r=8 or r=16 common. Start low.
  5. What modules does LoRA typically target? q_proj, v_proj, k_proj, o_proj (attention). Sometimes mlp.
  6. Merge adapters for what? Zero inference overhead. W' = W + BA, then inference = base model.
  7. QLoRA in one sentence? 4-bit quantize base, LoRA in fp16. Train 7B on 24GB.
  8. Double quantization? Quantize the scaling factors in 4-bit quant. Saves more memory.
  9. Paged optimizers? Offload optimizer states to CPU when GPU OOM. For memory-constrained training.
  10. SFTTrainer from? TRL (Transformer Reinforcement Learning library). Handles SFT, DPO.
  11. Data for LoRA? Hundreds to low thousands. Quality > quantity.
  12. SFT vs DPO vs RLHF? SFT = (input, output) pairs. DPO = (prompt, chosen, rejected). RLHF = reward model + PPO.
  13. Catastrophic forgetting mitigation? LoRA (freeze base), mix general data, early stopping, eval on general benchmarks.
  14. 7B LoRA on what GPU? Single A100 40GB or RTX 4090 24GB (with QLoRA).
  15. Provider vs self-host fine-tuning? Provider = convenience, data leaves. Self-host = control, data residency, open-source.

Key Takeaways (Cheat Sheet)

Topic Key Point
When to fine-tune Exhaust prompts + RAG first. Fine-tune for consistent behavior, extraction accuracy, or smaller latency-sensitive models.
When NOT to RAG handles knowledge. Prompts handle most behavior. <100 examples. Task changes often.
Full vs PEFT Full = all weights, expensive, forgetting risk. PEFT = adapters only. LoRA is default.
LoRA ΔW = BA. r=8–16. Target attention. Merge after training.
QLoRA 4-bit base + fp16 LoRA. 7B on 24GB. Double quant, paged optimizers.
HF stack Transformers, PEFT, TRL, Datasets, Accelerate. SFTTrainer for SFT.
Data 500–2K for LoRA. Quality > quantity. Chat or instruction format. Dedupe.
Pipeline Data prep → format → split → LoRA config → SFTTrainer → eval → merge → deploy.
Hyperparams LR 2e-5, epochs 2–3, batch 2–8, grad accum 4–16.
Catastrophic forgetting LoRA mitigates. Mix general data. Early stop. Eval general benchmarks.
SFT vs DPO vs RLHF SFT = default. DPO = preferences, no reward model. RLHF = alignment, expensive.
Infrastructure LoRA 7B: single A100 or 4090. QLoRA: 24GB. Cloud: Lambda, RunPod, SageMaker.
Provider fine-tuning OpenAI/Anthropic. Convenience vs control. Data residency often blocks.

Further Reading

  • LoRA paper (Hu et al., 2021) — Low-Rank Adaptation of Large Language Models. The foundational paper.
  • QLoRA paper (Dettmers et al., 2023) — Efficient Finetuning of Quantized LLMs. 4-bit + LoRA.
  • PEFT libraryHugging Face PEFT. LoRAConfig, get_peft_model.
  • TRLTransformer Reinforcement Learning. SFTTrainer, DPO Trainer.
  • Axolotl — Alternative training framework. YAML config, popular for Llama fine-tuning.
  • OpenAI Fine-TuningAPI docs. When to use provider vs self-host.
  • Anthropic Fine-Tuning — Claude fine-tuning offerings. Check current availability.
  • DPO paper — Direct Preference Optimization. Simpler than RLHF for preference learning.