6 min read
Fine-Tuning — Interview Questions
Q1: When would you choose fine-tuning over RAG? Walk through your decision process.
Answer:
Decision framework — ask these questions in order:
1. Is the problem about KNOWLEDGE or BEHAVIOR?
Knowledge (facts, documents, latest info) → RAG
Behavior (tone, format, style, domain vocabulary) → Fine-tuning
2. How often does the data change?
Frequently (daily/weekly) → RAG (re-embed, no retraining)
Rarely (stable taxonomy, brand voice) → Fine-tuning
3. How many examples do you have?
< 100 → Few-shot prompting
100-1000 → Fine-tuning becomes viable
1000+ → Fine-tuning shines
4. What's your latency budget?
RAG adds retrieval overhead (100-500ms for vector search)
Fine-tuned models skip retrieval → lower latency, shorter prompts
Concrete examples:
Use RAG: "What does our refund policy say?"
"Summarize this uploaded contract"
"What changed in v3.2 of our API?"
Use fine-tune: "Always respond in our brand voice"
"Generate SQL that follows our internal style"
"Classify support tickets with our custom taxonomy"
"Use our medical abbreviations correctly"
Use both: Fine-tune for behavior/style, RAG for knowledge
= consistent tone + up-to-date factsQ2: You have 200 customer support examples. How do you build and validate a fine-tuned model?
Answer:
python# Step 1: Data preparation and validation
import json
from openai import OpenAI
client = OpenAI()
def validate_and_prepare(raw_examples: list[dict]) -> list[dict]:
valid = []
for ex in raw_examples:
# Quality checks
if len(ex.get("response", "")) < 20:
continue # Too short
if len(ex.get("response", "")) > 2000:
continue # Probably too long for a turn
if not ex.get("query"):
continue # Missing input
valid.append({
"messages": [
{"role": "system", "content": "You are a helpful support agent for Acme Corp."},
{"role": "user", "content": ex["query"]},
{"role": "assistant", "content": ex["response"]},
]
})
return valid
# Step 2: Split train/eval (80/20)
import random
random.shuffle(examples)
split = int(len(examples) * 0.8)
train_data = examples[:split] # 160 examples
eval_data = examples[split:] # 40 examples (held out)
# Step 3: Augment to increase volume (optional but helpful with small datasets)
async def augment_example(ex: dict) -> list[dict]:
"""Generate 3 variations using GPT-4."""
prompt = f"""Generate 3 varied versions of this support interaction.
Keep the same topic, quality, and tone. Vary the wording.
Original:
User: {ex['messages'][1]['content']}
Assistant: {ex['messages'][2]['content']}
Return JSON array of 3 objects with keys "user" and "assistant"."""
result = await gpt4.ainvoke(prompt)
variations = json.loads(result.content)
return [
{"messages": [ex["messages"][0],
{"role": "user", "content": v["user"]},
{"role": "assistant", "content": v["assistant"]}]}
for v in variations
]
# After augmentation: 160 * 4 = 640 training examples
# Step 4: Upload and train
with open("train.jsonl", "w") as f:
for item in train_data:
f.write(json.dumps(item) + "\n")
file = client.files.create(file=open("train.jsonl", "rb"), purpose="fine-tune")
job = client.fine_tuning.jobs.create(
training_file=file.id,
model="gpt-4o-mini-2024-07-18",
hyperparameters={"n_epochs": 3},
validation_file=eval_file.id, # Track val loss
)
# Step 5: Evaluate on held-out eval set
def evaluate_fine_tuned(ft_model: str, eval_data: list) -> dict:
base_scores, ft_scores = [], []
for ex in eval_data[:20]: # Sample 20 for cost
query = ex["messages"][1]["content"]
expected = ex["messages"][2]["content"]
ft_response = client.chat.completions.create(
model=ft_model,
messages=ex["messages"][:2] # system + user only
).choices[0].message.content
base_response = client.chat.completions.create(
model="gpt-4o-mini",
messages=ex["messages"][:2]
).choices[0].message.content
# LLM judge: does it match brand voice?
ft_scores.append(judge_brand_voice(query, ft_response))
base_scores.append(judge_brand_voice(query, base_response))
return {
"base_avg": sum(base_scores) / len(base_scores),
"ft_avg": sum(ft_scores) / len(ft_scores),
"improvement": f"{(sum(ft_scores) - sum(base_scores)) / len(base_scores) * 100:.1f}%"
}
# Typical result: base 5.8/10 → fine-tuned 8.2/10 for brand voiceQ3: What is LoRA and why is it the dominant fine-tuning approach?
Answer:
Full fine-tuning problem:
7B model = ~14GB in float16
Updating all weights requires:
- Multiple A100 GPUs ($$$)
- 40+ GB VRAM
- Hours to days of training
LoRA insight: most of the useful "change" in weights can be approximated
by a low-rank matrix decomposition.
Instead of:
W_new = W_original + ΔW
Where ΔW is the same size as W (e.g., 4096 × 4096 = 16M params)
LoRA decomposes ΔW into two small matrices:
ΔW = B × A
Where A is (4096 × r) and B is (r × 4096), r = 8 or 16
r=16: A = 4096×16 = 65k params
B = 16×4096 = 65k params
Total adapter: 130k params vs 16M params original = 0.8% the size!
Benefits:
✓ Trains in hours on a single GPU (even consumer RTX 3090)
✓ Base model frozen → no catastrophic forgetting
✓ Multiple adapters → swap behavior without reloading base model
✓ Merge adapters at inference → zero overhead
QLoRA extends this:
Load base model in 4-bit (NF4 quantization)
Apply LoRA adapters in float16
7B model: 16GB → 4GB VRAM
Fine-tune a 7B model on a single 8GB GPU!
When to set r higher:
r=4: Style, tone (very simple task)
r=8: Standard recommendation for most tasks
r=16: Domain adaptation, complex behavior change
r=64: Major capability changes (rare, close to full fine-tune)Q4: Your fine-tuned model passes your eval benchmarks but production users complain it "sounds wrong." What happened?
Answer:
This is the train/eval distribution mismatch problem.
Root causes to investigate:
1. Eval set wasn't representative
- You evaluated on "nice" handpicked examples
- Production has messy, typo-filled, ambiguous queries
- Fix: Build eval set by sampling from real production traffic
2. Training data had implicit patterns the model picked up
- All examples from one support agent's style
- Examples from one product category only
- "Sounds wrong" on queries that fall outside the training distribution
- Fix: Audit training data distribution; ensure it covers the full query space
3. Catastrophic forgetting (less common with LoRA)
- Full fine-tuning can overwrite general language capabilities
- Model "over-fits" to training distribution
- Fix: Add regularization; reduce epochs; use LoRA instead of full FT
4. System prompt drift
- Training used a different system prompt than production
- The fine-tuned "personality" conflicts with the runtime system prompt
- Fix: Always fine-tune with the exact system prompt used in production
5. Hallucination shift
- Fine-tuning on confident, assertive responses
- Model learned to assert things confidently even when wrong
- Fix: Include "I don't know" examples in training data for out-of-scope queries
Diagnosis process:
1. Collect 20 "sounds wrong" examples from production
2. Label what specifically is wrong (tone? factual? format?)
3. Check if those query types appear in training data
4. If not → add more training examples of that type
5. If yes → the model didn't learn from them → improve example qualityQ5: Tricky: Someone says "just use more few-shot examples in the prompt instead of fine-tuning." When are they right and when are they wrong?
Answer:
They're RIGHT when:
- You have < 50 examples → not enough data to fine-tune anyway
- The task is straightforward (structured output, simple classification)
- You need to iterate quickly (prompt change = instant, FT = hours)
- Cost isn't a concern (context window cost is acceptable)
- The task changes frequently
Example: "Format responses as JSON with these 3 fields"
→ 3 shot examples in the prompt works perfectly fine
They're WRONG when:
- You have hundreds of high-quality examples
- The "style" or "behavior" is subtle and hard to show in 5-10 examples
- Context window is filling up with examples (expensive, slower)
- Consistency is critical — few-shot can drift across conversations
- Latency matters — 30 shot examples = 3000+ extra tokens = slower + $$$
- The behavior needs to generalize across many novel input types
The key insight:
Few-shot = showing the model AT INFERENCE TIME
Fine-tuning = teaching the model BEFORE inference
Cost comparison (at scale, 100k queries/day):
Few-shot (10 examples, ~1500 tokens): +$45/day on gpt-4o-mini
Fine-tuned model: $0 extra (shorter prompts, may cost less)
Fine-tuning pays off after: ~few weeks of the few-shot cost
Reality in production:
Correct answer = "use prompting first, fine-tune when it's not enough"
Treat fine-tuning as an optimization step, not a first resort[prev·next]