5 min read
Fine-Tuning LLMs
Fine-Tuning vs RAG vs Prompting
PROMPTING RAG FINE-TUNING
────────────────────────────────────────────────────────────────
When to use Style/format Facts/knowledge Behavior/tone/format
Simple tasks Live data Consistent style
Private data Domain specialization
Update cost Zero Re-embed docs Retrain (hours+GPU)
Knowledge Static Always current Static (as of train date)
factual cutoff
Latency Adds tokens +Retrieval time Baseline (no extra tokens)
Cost/query Context size Context size Smaller context → cheaper
Consistency Variable Variable High (baked into weights)
Best for:
Prompting: "Answer in JSON", "Act as a pirate", "Be concise"
RAG: "What does our policy say?", "Latest product docs"
Fine-tuning: "Always respond like our brand", "Medical terminology",
"Code in our internal style", "Classify with our taxonomy"When NOT to Fine-Tune
Don't fine-tune when:
✗ You need up-to-date information → use RAG
✗ Your dataset is small (<100 examples) → use few-shot prompting
✗ The task is simple → better prompting is faster and cheaper
✗ You need to debug/update quickly → retraining is slow
✗ Budget is tight → fine-tuning infra costs $$$
Do fine-tune when:
✓ You need consistent output format (e.g., always valid JSON with your schema)
✓ You have 1000+ high-quality examples
✓ The model needs domain-specific terminology (medical, legal, internal jargon)
✓ You need to reduce prompt length (compress few-shot examples into weights)
✓ Latency matters and shorter prompts help
✓ You want to remove capabilities (safety fine-tuning, RLHF)Fine-Tuning Methods
Full Fine-Tuning
All model weights updated.
Requires: Multiple high-end GPUs (7B model = ~56GB VRAM)
Best for: Major capability changes, academic research
Cost: $$$$ — use cloud (Lambda Labs, vast.ai, RunPod)LoRA (Low-Rank Adaptation) — Most Common
python# LoRA: instead of updating all weights, add small trainable "adapter" matrices
# Original: Y = W·X (W = frozen 4096×4096 = 16M params)
# LoRA: Y = W·X + B·A·X (A = 4096×16, B = 16×4096 = only 131k params)
# Key insight: the "update" to model weights is low-rank (most useful changes
# can be represented in a small subspace)
from peft import get_peft_model, LoraConfig, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # Rank of the adapter matrices (lower = fewer params)
lora_alpha=32, # Scaling factor (usually 2x r)
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # Which layers to adapt
lora_dropout=0.05,
bias="none",
)
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()
# → trainable params: 8,388,608 (0.1% of total) ← only 0.1% need updating!
# After training: merge adapters back into base model for deployment
merged_model = peft_model.merge_and_unload()QLoRA — Fine-Tune on Consumer Hardware
python# QLoRA = Quantized LoRA
# Load model in 4-bit quantization (4x memory reduction)
# + apply LoRA adapters
from transformers import BitsAndBytesConfig
import torch
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # Extra 0.4 bit per param
bnb_4bit_quant_type="nf4", # NormalFloat4 — best for LLMs
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=quantization_config,
device_map="auto",
)
# 8B model normally needs ~16GB VRAM (float16)
# QLoRA: needs ~5GB VRAM → fine-tune on a single RTX 3090!OpenAI Fine-Tuning (Easiest Path)
pythonfrom openai import OpenAI
import json
client = OpenAI()
# 1. Prepare training data (JSONL format)
training_data = [
{
"messages": [
{"role": "system", "content": "You are a customer support agent for Acme Corp."},
{"role": "user", "content": "How do I reset my password?"},
{"role": "assistant", "content": "To reset your password: go to acme.com/login, click 'Forgot Password', enter your email, check your inbox for a reset link. The link expires in 1 hour."}
]
},
# ... 50+ more examples
]
with open("training.jsonl", "w") as f:
for item in training_data:
f.write(json.dumps(item) + "\n")
# 2. Upload training file
with open("training.jsonl", "rb") as f:
file = client.files.create(file=f, purpose="fine-tune")
# 3. Start fine-tuning job
job = client.fine_tuning.jobs.create(
training_file=file.id,
model="gpt-4o-mini-2024-07-18", # Base model
hyperparameters={"n_epochs": 3},
)
# 4. Monitor training
while job.status not in ("succeeded", "failed"):
time.sleep(60)
job = client.fine_tuning.jobs.retrieve(job.id)
print(f"Status: {job.status}")
# 5. Use your fine-tuned model
fine_tuned_model = job.fine_tuned_model # e.g., "ft:gpt-4o-mini:acme:customer-support:abc123"
response = client.chat.completions.create(
model=fine_tuned_model,
messages=[
{"role": "system", "content": "You are a customer support agent for Acme Corp."},
{"role": "user", "content": "My order hasn't arrived"},
]
)Training Data Requirements
python# Minimum viable dataset sizes:
# OpenAI fine-tuning: 50-100 examples (works but weak)
# Good fine-tuning: 500-1000 examples
# Strong fine-tuning: 1000-10000 examples
# Full capability: 10000+ examples
# Data quality > quantity
# 100 perfect examples >> 1000 noisy examples
# Data collection strategies:
# 1. Manual curation (best quality, expensive)
# 2. Augmentation with GPT-4 (generate variations of good examples)
# 3. Distillation (use GPT-4 outputs to train smaller model)
# Augmentation example:
async def augment_example(example: dict) -> list[dict]:
"""Generate 5 variations of a training example."""
prompt = f"""Given this training example, generate 5 similar but varied versions.
Keep the same assistant response style and accuracy.
Original:
User: {example['user']}
Assistant: {example['assistant']}
Output as JSON array of {{"user": ..., "assistant": ...}} objects."""
result = await llm.ainvoke(prompt)
variations = json.loads(result.content)
return variations
# Data validation
def validate_training_example(example: dict) -> bool:
"""Check training example quality."""
msgs = example.get("messages", [])
if len(msgs) < 2:
return False
last = msgs[-1]
if last["role"] != "assistant":
return False
if len(last["content"]) < 10:
return False # Too short
if len(last["content"]) > 4000:
return False # Probably too long for a single turn
return TrueEvaluating Fine-Tuned Models
python# 1. Automated evaluation
from openai import OpenAI
def evaluate_response(question: str, model_response: str, expected_style: str) -> float:
"""Use GPT-4 to judge if response matches expected style."""
judge_prompt = f"""Rate this response on a scale of 0-10.
Criteria: {expected_style}
Question: {question}
Response: {model_response}
Output only a number 0-10."""
score = float(client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": judge_prompt}]
).choices[0].message.content.strip())
return score
# 2. A/B comparison (base vs fine-tuned)
def ab_compare(questions: list[str]) -> dict:
base_scores = []
ft_scores = []
for q in questions:
base_resp = call_model("gpt-4o-mini", q)
ft_resp = call_model(fine_tuned_model, q)
base_scores.append(evaluate_response(q, base_resp, "matches our brand voice"))
ft_scores.append(evaluate_response(q, ft_resp, "matches our brand voice"))
return {
"base_avg": sum(base_scores) / len(base_scores),
"ft_avg": sum(ft_scores) / len(ft_scores),
"improvement": (sum(ft_scores) - sum(base_scores)) / len(base_scores),
}
# 3. Regression testing — fine-tuning shouldn't break general capabilities
REGRESSION_SUITE = [
{"q": "What is 2+2?", "expected_contains": "4"},
{"q": "Write a hello world in Python", "expected_contains": "print"},
# ... general capability tests
]Fine-Tuning Cookbook (Practical Scenarios)
Scenario 1: Brand voice consistency
Problem: LLM responds generically, not like your brand
Solution: 200 examples of (customer query, ideal brand-voice response)
Model: gpt-4o-mini (cheaper, consistent at style)
Expected improvement: tone consistency 40% → 85%
Scenario 2: Internal tool extraction
Problem: Need to extract structured data from support tickets
Solution: 500 examples of (ticket text, JSON output)
Model: gpt-4o-mini with JSON mode
Expected: extraction accuracy 75% → 95%
Scenario 3: Code style
Problem: AI generates Python that doesn't follow your internal patterns
Solution: 1000 examples of (task, ideal code following your style guide)
Model: gpt-4o-mini
Expected: style compliance 30% → 80%
Scenario 4: Medical terminology
Problem: General LLM doesn't know your domain-specific abbreviations
Solution: 2000 examples with your clinical terminology
Model: Start with full fine-tune of smaller Llama model[prev·next]