Session 17: Fine-Tuning — PEFT, LoRA & Training Pipelines
Everyone wants to fine-tune. Almost nobody needs to. Fine-tuning is the nuclear option of LLM customization—powerful, expensive, and usually overkill. But when you do need it, nothing else comes close. Prompt engineering and RAG solve 90% of enterprise use cases. Fine-tuning is for the 10% where you've exhausted both and the model still doesn't behave the way you need.
At Maersk, we built the Enterprise AI Agent Platform with centralized LLMs, guardrails, and evaluations. Most of our customization lives in prompts, tools, and RAG. But when we hit cases—extraction accuracy that prompt engineering couldn't fix, domain terminology the base model mangled, or latency-sensitive deployments that needed a smaller model—we had to know the fine-tuning playbook. This session covers when to pull that lever, how LoRA and QLoRA make it practical on consumer hardware, and the full pipeline from data prep to deployment. You'll leave knowing exactly when fine-tuning beats prompt engineering and RAG, and how to do it without blowing your budget.
1. When Fine-Tuning Beats Prompt Engineering + RAG
Interview Insight: The interviewer is testing whether you can make trade-off decisions, not whether you know what fine-tuning is. "We fine-tuned because we could" is the wrong answer. You need a decision framework.
Think of it like renovating a house. Prompt engineering is rearranging the furniture—fast, reversible, cheap. RAG is installing a bookshelf and filling it with reference manuals—adds knowledge without changing the house. Fine-tuning is knocking down walls and rewiring—expensive, permanent, and you'd better be sure you need it. You don't knock down walls because you could use more shelf space.
When to fine-tune:
-
Consistent custom behavior — Tone, format, domain terms the model keeps getting wrong. Example: you need every shipping confirmation email to start "Dear Valued Partner," include a specific disclaimer, and use "TEU" not "twenty-foot equivalent unit." Prompts work sometimes; fine-tuning makes it consistent.
-
Style or format prompts can't achieve — The model keeps slipping into casual language when you need formal, or vice versa. Few-shot helps but doesn't scale. Fine-tuning bakes the style in.
-
Latency-sensitive deployments needing smaller models — You need sub-100ms response with a 7B model that behaves like your custom use case. Off-the-shelf 7B models don't know your schema; prompting adds tokens and latency. Fine-tune a small model for your narrow task.
-
Extraction accuracy — Structured extraction from emails, invoices, or forms where schema adherence and field accuracy matter. If RAG + prompt engineering yields 85% and you need 98%, fine-tuning can close the gap.
When NOT to fine-tune:
- RAG handles knowledge — New policies, updated rates, carrier info. That's retrieval, not weights.
- Prompt engineering handles behavior — Most format tweaks, few-shot examples, system prompts. Try these first.
- You have <100 high-quality examples — Fine-tuning on junk or tiny data makes things worse.
- The task changes often — Fine-tuning locks you in. Prompts are easy to update.
Why This Matters in Production: At Maersk, our email booking agent used RAG for templates and prompt engineering for extraction format. We only considered fine-tuning when we hit a ceiling on port code accuracy—the model kept confusing similar codes despite retrieval. That's the bar: exhaust simpler options, measure the gap, then decide.
Aha Moment: Fine-tuning is a last resort, not a first step. The best engineers know when not to fine-tune.
2. Full Fine-Tuning vs Parameter-Efficient Methods (PEFT)
Interview Insight: PEFT is the default for most production fine-tuning in 2024–2025. Full fine-tuning is the exception. Interviewers expect you to know both and when to pick each.
Full fine-tuning updates every weight in the model. For a 7B model, that's 7 billion parameters. You need lots of data (tens of thousands of examples), lots of GPU memory (multiple A100s for 7B at FP16), and you risk catastrophic forgetting—the model loses general capabilities and overfits to your narrow task. Like rewriting an entire encyclopedia because you want to add one paragraph.
PEFT freezes most of the base model and trains only small adapter layers. You're adding sticky notes, not rewriting the book. Benefits: far fewer trainable parameters, less data needed, lower memory, base knowledge preserved. Trade-off: some tasks need full model capacity; PEFT can hit ceilings.
Types of PEFT:
| Method | Idea | When to Use |
|---|---|---|
| LoRA | Low-rank decomposition of weight updates | Default. Best balance of quality and efficiency. |
| Prefix Tuning | Prepend learnable prompt vectors | When you want prompt-like behavior, trainable. |
| Prompt Tuning | Soft prompt embeddings only | Cheaper than LoRA, often lower quality. |
| Adapters | Small bottleneck layers (Houlsby, etc.) | Older approach; LoRA usually wins now. |
Why This Matters in Production: For enterprise workloads, LoRA is the default. Full fine-tuning is for research or when you have massive proprietary datasets and need every last percent of accuracy.
Aha Moment: PEFT isn't a compromise—it's often better than full fine-tuning because it preserves the base model and reduces overfitting.
3. LoRA Mechanics
Interview Insight: You should be able to draw the LoRA math on a whiteboard. W + BA. Rank r. Why r << d. That's table stakes for a Senior AI Engineer fine-tuning discussion.
Instead of updating the full weight matrix W (d×k), LoRA decomposes the update as ΔW = BA, where B is d×r and A is r×k, with r << d. At forward pass: output = (W + BA)x = Wx + B(Ax). You freeze W and only train A and B. For a 4096×4096 layer, that's 33M parameters vs 4M for r=8. Huge savings.
Rank selection: r=4 to r=64 typical. r=8 or r=16 for most tasks. Higher rank = more capacity, more parameters, more overfitting risk. Start low; increase if underfitting.
Target modules: Usually attention layers—q_proj, v_proj, k_proj, o_proj. Some configs add mlp (feed-forward). Targeting only q_proj and v_proj is a common minimal setup. All linear layers = max capacity, max params.
Merge after training: LoRA adapters add inference overhead (extra matmuls). You can merge the adapters into the base weights: W_new = W + BA. Then inference is identical to the base model—zero overhead. Do this before deployment.
Why This Matters in Production: LoRA lets you train a 7B model on a single A100 or even an RTX 4090. Merge adapters for production to avoid latency bloat.
Aha Moment: LoRA works because of the low-rank hypothesis: weight updates during adaptation lie in a low-dimensional subspace. We're not losing much by compressing them.
4. QLoRA
Interview Insight: QLoRA is how you fine-tune on consumer hardware. 7B model on 24GB? QLoRA. Know the ingredients: 4-bit quant base, fp16 adapters, double quantization, paged optimizers.
QLoRA = Quantized LoRA. Quantize the base model to 4-bit (NF4—NormalFloat4, designed for neural network weights). Apply LoRA adapters in fp16/bf16. The base model stays tiny in memory; only the adapters need full precision. Result: train 7B on a single 24GB GPU (e.g., RTX 4090, A10, or consumer cards).
Double quantization: Quantize the quantization constants (scaling factors) to save another few hundred MB. Usually worth it.
Paged optimizers: Offload optimizer states to CPU when GPU memory spikes (e.g., during gradient accumulation). Prevents OOM. Slows training slightly but unlocks training on memory-constrained machines.
Caveats: 4-bit base = some quality loss. For most adaptation tasks, it's negligible. If you need every last bit of fidelity, use regular LoRA with FP16 base (more VRAM).
Why This Matters in Production: This is how startups and enterprises without A100 clusters fine-tune. At Maersk, we could prototype extraction models on a single GPU before scaling to training clusters.
Aha Moment: QLoRA democratized fine-tuning. You don't need a cloud GPU budget to experiment.
5. Hugging Face Ecosystem
Interview Insight: HF is the de facto standard. Transformers, PEFT, TRL, Datasets, Accelerate—you should know what each does and how they fit together.
| Library | Role |
|---|---|
| Transformers | Model loading, tokenizers, pipelines |
| PEFT | LoRA config, get_peft_model, adapter management |
| TRL | SFTTrainer, DPO Trainer, RLHF pipelines |
| Datasets | Load, format, split, streaming |
| Accelerate | Multi-GPU, mixed precision, gradient accumulation |
Typical flow: Load model with transformers.AutoModelForCausalLM, wrap with peft.get_peft_model and LoraConfig, train with trl.SFTTrainer, evaluate, merge with model.merge_and_unload(), save.
flowchart LR
subgraph Load[Load & Configure]
base[Base Model]
lora[LoraConfig]
peft[get_peft_model]
base --> lora --> peft
end
subgraph Train[Train]
data[Dataset]
sft[SFTTrainer]
peft --> sft
data --> sft
end
subgraph Deploy[Deploy]
merge[merge_and_unload]
save[Save merged model]
sft --> merge --> save
endWhy This Matters in Production: This stack is what you'll use in 95% of fine-tuning projects. Alternatives (Axolotl, LLaMA-Factory) build on the same primitives.
Aha Moment: SFTTrainer handles chat formatting, packing, and logging. Don't roll your own training loop unless you have a reason.
6. Data Requirements
Interview Insight: "How much data do you need?"—expect this. The answer is always "it depends," but you need concrete ranges and the importance of quality.
How much data:
- LoRA: Hundreds to low thousands of high-quality examples. 500–2000 is typical for narrow tasks. 5000+ for broader behavior.
- Full fine-tuning: Tens of thousands. 50K+ for meaningful full adaptation.
- QLoRA: Same as LoRA; quantization doesn't change data needs.
Data quality > quantity. 500 clean, diverse, correctly formatted examples beat 10,000 messy ones. Garbage in, garbage out. Deduplicate. Remove near-duplicates (embed and cluster; keep one per cluster). Check for label noise, format drift, and bias.
Formats:
- Instruction/response pairs:
{"instruction": "...", "response": "..."}— most common for SFT. - Chat format:
[{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]— for multi-turn or chat models. - Completion format: Raw text continuations for causal LM training (less common for instruction models).
Why This Matters in Production: At Maersk, extraction fine-tuning would use (email snippet, structured JSON) pairs. We'd need hundreds of real or synthetic examples with correct schema. Quality mattered more than volume.
Aha Moment: Spend 80% of your time on data. The training loop is a commodity.
7. Training Pipeline Design
Interview Insight: Interviewers want to see you think in pipelines, not one-off scripts. Data → format → split → config → train → eval → merge → deploy.
Pipeline stages:
- Data prep — Load, clean, deduplicate, validate format.
- Format conversion — Convert to chat or instruction format expected by the tokenizer (e.g., ChatML, Llama chat template).
- Train/val split — 90/10 or 95/5. Stratify if you have categories.
- Configure LoRA — Rank, alpha, target modules, dropout.
- Train with SFTTrainer — Learning rate, epochs, batch size, gradient accumulation.
- Evaluate — On held-out set and (if possible) general benchmarks to check forgetting.
- Merge adapters —
merge_and_unload()for zero-overhead inference. - Deploy — Export to vLLM, Triton, or API server.
Key hyperparameters:
| Param | Typical range | Notes |
|---|---|---|
| Learning rate | 1e-4 to 2e-5 | LoRA: often 2e-5. Higher = faster, risk of instability. |
| Epochs | 1–5 | 2–3 common. Watch for overfitting. |
| Batch size | 2–8 per device | Use gradient accumulation for effective batch. |
| Gradient accumulation | 4–16 | Effective batch = per_device × accumulation × devices. |
| Warmup | 3–10% | Stabilizes early training. |
Why This Matters in Production: A reproducible pipeline beats ad-hoc experimentation. Version your data, config, and checkpoints. Use MLflow or similar to track runs.
Aha Moment: Early stopping on validation loss prevents overfitting. Monitor both task accuracy and general capability (e.g., MMLU sample) to catch forgetting.
8. Catastrophic Forgetting
Interview Insight: "What is catastrophic forgetting?" — Standard question. You need the definition, why it happens, and mitigations.
The model loses general capabilities after fine-tuning on narrow data. It was great at reasoning, coding, and general knowledge; after fine-tuning on shipping confirmations, it forgets how to do basic math or follows your format but can't generalize. Like training a doctor to specialize in cardiology and having them forget how to take blood pressure.
Why: Gradient updates optimize for your task. General knowledge gets overwritten.
Mitigations:
- LoRA (or PEFT) — Base weights stay frozen. Only adapters change. Major protection.
- Mix general data — Add 5–20% general instruction data (Alpaca, ShareGPT) to the mix. Anchors general capabilities.
- Early stopping — Stop before overfitting. More epochs = more forgetting.
- Evaluation — Run general benchmarks (MMLU, HumanEval sample) alongside task eval. If general drops, you've forgotten.
Why This Matters in Production: Fine-tuning for extraction shouldn't kill the model's ability to handle edge cases or reason. Monitor both task and general metrics.
Aha Moment: LoRA is your first line of defense. It's not just about efficiency—it's about preserving the base model.
9. SFT vs RLHF vs DPO
Interview Insight: SFT is what you'll do 90% of the time. RLHF and DPO are for alignment and preference learning. Know the difference and when each applies.
| Method | What it does | When to use |
|---|---|---|
| SFT | Supervised Fine-Tuning. Train on (input, desired output) pairs. Standard instruction tuning. | Default. Most adaptation tasks. |
| RLHF | Reinforcement Learning from Human Feedback. Train a reward model on preferences, then optimize policy (model) with PPO. | Alignment. Used by OpenAI, Anthropic for Chat models. Expensive, complex. |
| DPO | Direct Preference Optimization. Uses preference pairs (chosen vs rejected) directly—no separate reward model. | Simpler alternative to RLHF. When you have preference data and want better alignment. |
SFT — You have correct answers. Train the model to produce them. Extraction, classification, format adherence, domain terminology. This is the bread and butter.
RLHF — You have preferences (A is better than B) but not necessarily "correct" answers. Train a reward model to score outputs, then use PPO to maximize reward. Powerful but needs lots of preference data, reward model training, and PPO tuning. Rarely worth it for enterprise unless you're building a ChatGPT competitor.
DPO — You have (prompt, good_response, bad_response) triples. DPO trains the policy directly to prefer good over bad. No reward model. Simpler than RLHF, often similar results. Use when you have preference data (e.g., human ratings of model outputs) and want to improve alignment.
Why This Matters in Production: For Maersk-style extraction and booking, SFT is sufficient. DPO could help if we had human feedback on "this extraction is better than that one" and wanted to optimize for that. RLHF is overkill.
Aha Moment: Start with SFT. Add DPO only if you have preference data and need alignment beyond what SFT gives.
10. Cost and Infrastructure
Interview Insight: "How much does fine-tuning cost?" — Have numbers. LoRA on 7B: single A100 or RTX 4090. Full FT: multi-GPU. Cloud options matter.
GPU sizing:
| Setup | Model size | Hardware |
|---|---|---|
| LoRA / QLoRA | 7B | Single A100 (40GB), RTX 4090 (24GB), A10 |
| LoRA / QLoRA | 13B | A100 40GB or 80GB |
| LoRA | 70B | 2–4x A100 80GB |
| Full FT | 7B | 2–4x A100 |
| Full FT | 70B | 8+ A100 80GB |
Cloud options: Lambda Labs, RunPod, vast.ai (cheap spot GPUs), AWS SageMaker, GCP Vertex AI. Lambda and RunPod often cheaper for short bursts. SageMaker for enterprise integration.
Training time (rough): LoRA 7B, 1K examples, 3 epochs—~30 min to 2 hours on A100. Full FT—hours to a day.
Cost comparison vs API: Fine-tuning has upfront cost (GPU hours) but then inference is cheap if self-hosted. API: pay per token forever. Rule of thumb: if you're doing millions of inferences on the same task, fine-tuning can pay off. If volume is low or task shifts often, API wins.
Why This Matters in Production: Budget before you start. A $50 RunPod A100 for a few hours is fine for experimentation. Production training might need reserved instances and reproducibility.
Aha Moment: QLoRA on a 24GB card = fine-tune 7B for under $20 in cloud compute. That changes the calculus for prototyping.
11. Practical Examples
Interview Insight: Be ready to give concrete examples. Extraction, classification, domain terms, format, code. These map to real interview questions.
| Use case | Why fine-tune | What you gain |
|---|---|---|
| Extraction accuracy | Schema and field accuracy matter. Prompts get 85%, you need 98%. | Consistent JSON structure, fewer field errors, better handling of edge formats. |
| Classification | Custom labels, nuanced categories. Off-the-shelf models don't know your taxonomy. | Higher precision/recall on your labels. |
| Domain terminology | Model keeps saying "container" when you need "TEU" or "FCL." | Correct jargon, fewer corrections. |
| Response format | Email template, disclaimer, specific structure. Prompts drift. | Format baked in. No prompt tokens for format. |
| Code generation on internal APIs | Model doesn't know your SDK, conventions, or internal endpoints. | Correct imports, API calls, patterns. |
For Maersk email booking: extraction (carrier, port, date, rate from emails), format (booking confirmation template), domain terms (TEU, FCL, B/L, etc.). Fine-tuning could target all three if prompt engineering hit a ceiling.
Why This Matters in Production: Each example has a clear metric. Extraction: field-level accuracy. Format: template adherence. Terms: term error rate. Measure before and after.
Aha Moment: Fine-tune for one thing per run when possible. Multi-objective fine-tuning is harder to tune and evaluate.
12. OpenAI / Anthropic Fine-Tuning APIs
Interview Insight: Provider fine-tuning vs self-hosted is a trade-off. Know when to use each.
OpenAI fine-tuning — GPT-4o, GPT-4o-mini, etc. Upload JSONL, they train. You get a fine-tuned model ID, use it like base model. Pay per token (usually slightly higher than base). No GPU management. Limited control over hyperparameters and architecture.
Anthropic Claude fine-tuning — Similar. Upload data, they train. Limited model availability (check current offerings). Same trade-off: convenience vs control.
When to use provider fine-tuning:
- You want zero infra. No GPU, no training pipeline.
- Your data can be sent to the provider (privacy/compliance OK).
- You're fine with their pricing and model versioning.
- You don't need custom architectures (e.g., specific LoRA config).
When to use self-hosted:
- Data must stay on-prem or in your cloud.
- You need full control: LoRA rank, target modules, custom data formats.
- Volume is high enough that inference cost savings from self-hosting matter.
- You're fine-tuning open-source (Llama, Mistral, etc.) that providers don't offer.
Why This Matters in Production: Maersk likely has data residency constraints—sending customer emails to OpenAI for fine-tuning may be a no-go. Self-hosted with QLoRA on internal infra is the default for sensitive data.
Aha Moment: Provider fine-tuning is a fast path. Self-hosted is the flexible path. Pick based on data, compliance, and control needs.
Mermaid Diagrams
Fine-Tuning Decision Flow
flowchart TD
need[Need Customization]
pe[Try Prompt Engineering]
peOk{Good Enough?}
rag[Add RAG]
ragOk{Good Enough?}
ft[Fine-Tune]
need --> pe
pe --> peOk
peOk -->|Yes| done[Ship It]
peOk -->|No| rag
rag --> ragOk
ragOk -->|Yes| done
ragOk -->|No| ft
ft --> doneLoRA Weight Update
flowchart LR
subgraph Full["Full Update (expensive)"]
W[W d x k]
deltaW["Delta W d x k"]
Wnew["W' = W + Delta W"]
W --> deltaW --> Wnew
end
subgraph LoRA["LoRA (efficient)"]
W2[W frozen]
B["B d x r"]
A["A r x k"]
BA["BA = Delta W"]
W2 --> B
B --> A --> BA
endTraining Pipeline
flowchart TB
data[Data Prep]
format[Format Conversion]
split[Train/Val Split]
config[LoRA Config]
train[SFTTrainer]
eval[Evaluate]
merge[Merge Adapters]
deploy[Deploy]
data --> format --> split --> config --> train --> eval --> merge --> deploySFT vs DPO vs RLHF
flowchart TB
subgraph sft[SFT]
sftIn["(input, output) pairs"]
sftTrain[Train on correct answers]
sftIn --> sftTrain
end
subgraph dpo[DPO]
dpoIn["(prompt, chosen, rejected)"]
dpoTrain[Direct preference optimization]
dpoIn --> dpoTrain
end
subgraph rlhf[RLHF]
rm[Train reward model]
ppo[PPO policy optimization]
rm --> ppo
endQLoRA Memory Layout
flowchart TB
subgraph gpu[GPU Memory]
base4["Base Model 4-bit"]
adapters["LoRA Adapters fp16"]
opt["Optimizer States"]
end
subgraph cpu[CPU optional]
paged["Paged Optimizer"]
end
base4 --> adapters --> opt
opt -.->|"if OOM"| pagedCode Examples
LoRA Config with PEFT
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
model_name = "meta-llama/Llama-3.2-3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4.2M || all params: 2.9B || trainable%: 0.14SFTTrainer Setup
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
dataset = load_dataset("json", data_files="train.jsonl", split="train")
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
dataset_num_proc=4,
packing=True,
args=SFTConfig(
output_dir="./finetuned",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=2e-5,
warmup_ratio=0.03,
logging_steps=10,
save_strategy="epoch",
fp16=True,
),
)
trainer.train()
trainer.save_model("./finetuned")Chat Format Data
def format_chat(example):
messages = [
{"role": "user", "content": example["instruction"]},
{"role": "assistant", "content": example["response"]},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False,
)
return {"text": text}
dataset = dataset.map(format_chat, remove_columns=dataset.column_names)
# Use dataset_text_field="text" in SFTTrainerInference with Merged Model
from peft import PeftModel
# Option 1: Merge adapters for zero-overhead inference
model = model.merge_and_unload()
model.save_pretrained("./merged_model")
tokenizer.save_pretrained("./merged_model")
# Option 2: Load merged model and run inference
from transformers import pipeline
pipe = pipeline("text-generation", model="./merged_model", tokenizer=tokenizer)
output = pipe(
"Extract carrier, port, and date from: Ship via Maersk from LAX to Rotterdam, ETA 2024-03-20",
max_new_tokens=128,
temperature=0,
)
print(output[0]["generated_text"])QLoRA with BitsAndBytes
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="bfloat16",
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
)
model = get_peft_model(model, lora_config)
# Now train with SFTTrainer - fits on 24GB GPUConversational Interview Q&A
1. "When would you fine-tune instead of using RAG or prompt engineering?"
Weak answer: "When we need better accuracy or custom behavior." (Too vague. Doesn't give a decision framework.)
Strong answer: "I'd exhaust prompt engineering and RAG first. Fine-tune when: (1) we need consistent format or tone that prompts can't achieve—e.g., every shipping confirmation must start with a specific disclaimer. (2) extraction accuracy hits a ceiling—we're at 85% with RAG+prompts and need 98%. (3) latency-sensitive deployment needing a smaller model that behaves like our task—fine-tune 7B for extraction so we don't need the 70B API. At Maersk, our email booking agent used RAG for templates. We'd only consider fine-tuning if we hit a ceiling on port code or carrier extraction accuracy that we couldn't fix with better prompts or retrieval."
2. "Explain LoRA. Why does it work?"
Weak answer: "LoRA trains small adapter layers instead of the full model. It's parameter-efficient." (Missing the math and the intuition.)
Strong answer: "LoRA decomposes the weight update as ΔW = BA, where B is d×r and A is r×k, r << d. Instead of updating 33M parameters in a 4096×4096 layer, we train maybe 65K for r=8. The low-rank hypothesis: adaptations during fine-tuning lie in a low-dimensional subspace. We're not losing much. We freeze the base W and only train A and B. After training, we merge: W' = W + BA. Inference is identical to base model—zero overhead. It works because most task-specific updates are compressible."
3. "What's the difference between QLoRA and LoRA?"
Weak answer: "QLoRA uses quantization." (Doesn't explain what or why.)
Strong answer: "QLoRA quantizes the base model to 4-bit (NF4) while keeping LoRA adapters in fp16. The base model uses a fraction of the memory—a 7B model goes from ~14GB to ~4GB. That lets you train on a 24GB consumer GPU. Double quantization squeezes the scaling factors too. Paged optimizers offload optimizer states to CPU on memory spikes. Trade-off: 4-bit base has slight quality loss, but for most adaptation tasks it's negligible. QLoRA democratized fine-tuning—you don't need A100s."
4. "How much data do you need for LoRA?"
Weak answer: "A few thousand examples." (No nuance on quality vs quantity.)
Strong answer: "Hundreds to low thousands for narrow tasks—500 to 2000 high-quality examples is typical. Quality trumps quantity. 500 clean, diverse, correctly formatted examples beat 10K messy ones. For broader behavior or style, 5K+. Deduplicate, remove near-duplicates, check for label noise. For our Maersk extraction use case, we'd need hundreds of (email, correct JSON) pairs. I'd spend most of the effort on data curation—the training loop is straightforward."
5. "What is catastrophic forgetting and how do you mitigate it?"
Weak answer: "The model forgets things. We use LoRA to reduce it." (Incomplete. No other mitigations.)
Strong answer: "Catastrophic forgetting is when fine-tuning on narrow data erodes general capabilities—the model was good at reasoning, now it can't. LoRA is the main mitigation: we freeze the base, only train adapters. Beyond that: mix in 5–20% general instruction data to anchor capabilities, use early stopping to avoid overfitting, and evaluate on general benchmarks (MMLU sample) alongside task metrics. If general drops, we've forgotten. At Maersk, we'd monitor both extraction accuracy and a small set of general questions."
6. "When would you use provider fine-tuning (OpenAI/Anthropic) vs self-hosted?"
Weak answer: "Depends on cost and control." (Missing data residency, compliance, volume.)
Strong answer: "Provider fine-tuning: zero infra, fast iteration, data can leave your environment. Good when you're prototyping or when compliance allows. Self-hosted: when data must stay on-prem or in your cloud—Maersk has data residency requirements, so customer emails likely can't go to OpenAI. Also when you need full control—custom LoRA config, open-source models, specific data formats. And when inference volume is high enough that self-hosting pays off. Rule of thumb: sensitive data or high volume → self-host. Quick experiment, no compliance concerns → provider."
7. "Walk me through a fine-tuning pipeline from data to deployment."
Weak answer: "We prepare data, train with SFTTrainer, and deploy." (No detail.)
Strong answer: "Data prep: load, clean, deduplicate. Format: convert to chat or instruction format the tokenizer expects. Split 90/10. Configure LoRA—r=16, target q_proj/v_proj/k_proj/o_proj. SFTTrainer with 2e-5 LR, 3 epochs, batch size 2, grad accum 8. Evaluate on val set and a sample of general benchmarks. Merge adapters with merge_and_unload for zero-overhead inference. Save. Deploy to vLLM or our API layer. Version data, config, and checkpoint in MLflow. At Maersk we'd add evals for extraction field accuracy before and after to measure lift."
From Your Experience (Maersk Prompts)
When they ask "tell me about a time you..."—you've got these. Tailor to your actual experience.
"Have you fine-tuned models at Maersk? When would you have considered it for the email booking agent?"
Most of our customization is prompt engineering and RAG. We'd consider fine-tuning if extraction accuracy hit a ceiling—e.g., port code confusion (LAX vs Lax) or carrier name variants that prompts couldn't fix. Or if we needed a smaller, faster model for edge deployment with the same extraction behavior. The decision would come after exhausting better prompts, more few-shot examples, and improved retrieval. Fine-tuning is the last resort.
"How would you design a fine-tuning pipeline for the Enterprise AI Agent Platform?"
We'd need a clear use case first—e.g., extraction, classification, or format adherence for a specific agent. Data: curate instruction/response or chat-format pairs, quality over quantity. Use PEFT (LoRA or QLoRA) to preserve base capabilities. Train with SFTTrainer, evaluate on task metrics plus a sample of general benchmarks. Integrate with MLflow for experiment tracking. Merge adapters and deploy via our model serving layer. Guardrails and evaluations would run on fine-tuned models the same as base models. We'd only do this for workloads where prompt engineering had hit a limit.
"What infrastructure would you use for fine-tuning at Maersk—cloud or on-prem?"
Given data residency, on-prem or approved cloud (e.g., our own Azure/AWS) is likely required. We'd use QLoRA for 7B models on 24GB cards—single A10 or A100 per run. For larger models or full fine-tuning, we'd need a multi-GPU cluster. Training would be episodic—not continuous—so spot or reserved instances depending on urgency. We'd track cost per run and compare to API costs to justify when fine-tuning pays off.
Quick Fire Round
- When does fine-tuning beat RAG? When you need consistent behavior (tone, format, terms) or extraction accuracy that prompts can't achieve. Not for knowledge—that's RAG.
- What is PEFT? Parameter-Efficient Fine-Tuning. Freeze most weights, train small adapters. LoRA, Prefix Tuning, Prompt Tuning.
- LoRA formula? ΔW = BA. B d×r, A r×k. r << d. Train A and B only.
- Typical LoRA rank? r=4 to r=64. r=8 or r=16 common. Start low.
- What modules does LoRA typically target? q_proj, v_proj, k_proj, o_proj (attention). Sometimes mlp.
- Merge adapters for what? Zero inference overhead. W' = W + BA, then inference = base model.
- QLoRA in one sentence? 4-bit quantize base, LoRA in fp16. Train 7B on 24GB.
- Double quantization? Quantize the scaling factors in 4-bit quant. Saves more memory.
- Paged optimizers? Offload optimizer states to CPU when GPU OOM. For memory-constrained training.
- SFTTrainer from? TRL (Transformer Reinforcement Learning library). Handles SFT, DPO.
- Data for LoRA? Hundreds to low thousands. Quality > quantity.
- SFT vs DPO vs RLHF? SFT = (input, output) pairs. DPO = (prompt, chosen, rejected). RLHF = reward model + PPO.
- Catastrophic forgetting mitigation? LoRA (freeze base), mix general data, early stopping, eval on general benchmarks.
- 7B LoRA on what GPU? Single A100 40GB or RTX 4090 24GB (with QLoRA).
- Provider vs self-host fine-tuning? Provider = convenience, data leaves. Self-host = control, data residency, open-source.
Key Takeaways (Cheat Sheet)
| Topic | Key Point |
|---|---|
| When to fine-tune | Exhaust prompts + RAG first. Fine-tune for consistent behavior, extraction accuracy, or smaller latency-sensitive models. |
| When NOT to | RAG handles knowledge. Prompts handle most behavior. <100 examples. Task changes often. |
| Full vs PEFT | Full = all weights, expensive, forgetting risk. PEFT = adapters only. LoRA is default. |
| LoRA | ΔW = BA. r=8–16. Target attention. Merge after training. |
| QLoRA | 4-bit base + fp16 LoRA. 7B on 24GB. Double quant, paged optimizers. |
| HF stack | Transformers, PEFT, TRL, Datasets, Accelerate. SFTTrainer for SFT. |
| Data | 500–2K for LoRA. Quality > quantity. Chat or instruction format. Dedupe. |
| Pipeline | Data prep → format → split → LoRA config → SFTTrainer → eval → merge → deploy. |
| Hyperparams | LR 2e-5, epochs 2–3, batch 2–8, grad accum 4–16. |
| Catastrophic forgetting | LoRA mitigates. Mix general data. Early stop. Eval general benchmarks. |
| SFT vs DPO vs RLHF | SFT = default. DPO = preferences, no reward model. RLHF = alignment, expensive. |
| Infrastructure | LoRA 7B: single A100 or 4090. QLoRA: 24GB. Cloud: Lambda, RunPod, SageMaker. |
| Provider fine-tuning | OpenAI/Anthropic. Convenience vs control. Data residency often blocks. |
Further Reading
- LoRA paper (Hu et al., 2021) — Low-Rank Adaptation of Large Language Models. The foundational paper.
- QLoRA paper (Dettmers et al., 2023) — Efficient Finetuning of Quantized LLMs. 4-bit + LoRA.
- PEFT library — Hugging Face PEFT. LoRAConfig, get_peft_model.
- TRL — Transformer Reinforcement Learning. SFTTrainer, DPO Trainer.
- Axolotl — Alternative training framework. YAML config, popular for Llama fine-tuning.
- OpenAI Fine-Tuning — API docs. When to use provider vs self-host.
- Anthropic Fine-Tuning — Claude fine-tuning offerings. Check current availability.
- DPO paper — Direct Preference Optimization. Simpler than RLHF for preference learning.