AI Fundamentals — Tricky Questions
Questions That Trip People Up
Q1: If temperature=0 makes output deterministic, why do you sometimes get different responses with temperature=0?
Answer:
Several reasons output can still vary at temperature=0:
- Floating-point non-determinism — GPU parallel computation order can vary between runs due to floating-point rounding differences. Different hardware, different results.
- Model versioning — OpenAI silently updates model versions.
gpt-4today ≠gpt-4next month. - Streaming vs non-streaming — Different code paths can produce slightly different outputs.
- Seed parameter — Some APIs have a separate
seedparameter for true reproducibility (OpenAI added this). Temperature alone isn't enough. - Top-p interaction — If both temperature and top-p are set, they interact in non-obvious ways.
python# For reproducibility, use seed + temperature=0
response = client.chat.completions.create(
model="gpt-4o",
temperature=0,
seed=42, # Added for reproducibility
messages=[...]
)Q2: "LLMs don't understand language — they just predict tokens." Do you agree?
Answer (nuanced):
This is a genuinely contested question. The honest answer is: we don't know.
Arguments for "just predicting tokens":
- LLMs are trained purely on next-token prediction
- They fail on tasks that require true logical reasoning consistently
- They can be fooled by surface-level patterns (adversarial inputs)
- No grounded perception of the world (no sensory input)
Arguments against "just predicting":
- Emergent capabilities (in-context learning, arithmetic, code) weren't programmed
- GPT-4 passes bar exams, medical licensing exams — is that "just" prediction?
- Representations in the model encode factual relationships about the world
- Self-consistency and chain-of-thought improve reasoning → something reasoning-like is happening
The practical engineering answer: Treat LLMs as very powerful pattern matchers with emergent reasoning capabilities. Design your systems with the assumption they can reason but will make mistakes — use verification, structured outputs, and retrieval to compensate.
Q3: Why can't you just increase context window to solve RAG?
Answer:
Even with 1M token context, RAG is still better because:
Cost — GPT-4o charges per token. Sending a 1M token context costs ~$10/query. RAG retrieves only the relevant 1-5% of knowledge.
"Lost in the middle" problem — LLMs recall beginning and end of context well, but performance degrades for information in the middle of long contexts. Retrieved chunks put relevant info at the top.
Latency — Processing 1M tokens takes several seconds. RAG retrieves <5 chunks in milliseconds.
Dynamic knowledge — You can update your vector store without changing the model. You can't update a stuffed context.
Citation and traceability — RAG shows you which documents were used. Stuffed context makes attribution impossible.
1M token context window:
Cost per query: ~$10 (if full)
Latency: 10-30 seconds
Recall @ middle: degrades
RAG (top-5 chunks, ~2000 tokens):
Cost per query: ~$0.02
Latency: 200ms retrieval + 1-2s generation
Recall: high (relevant chunks at top)Q4: A model scores 95% accuracy on your test set but fails in production. Why?
Answer — this is a classic ML gotcha:
Train-test distribution shift — Test set doesn't represent production data. Model memorized test patterns.
Data leakage — Test data accidentally leaked into training (e.g., chronological split not respected, or test set derived from same source as train).
Label quality — If test labels were produced by the same process that produced training labels (e.g., another model), accuracy measures how well you mimic that process, not ground truth.
Metric mismatch — Accuracy is a bad metric for imbalanced classes. 95% accuracy on 95/5 class split = model just predicts majority class.
Overfitting — Model memorized training data patterns that don't generalize.
Production edge cases — Real users query in ways your test set never captured (typos, unusual phrasing, adversarial inputs).
Mitigation:
- Use proper train/val/test splits with temporal awareness
- Monitor production distribution (data drift)
- Use LLM-as-judge on a random sample of production outputs
- A/B test before full rollout
Q5: What's the difference between top_p=0.1 and temperature=0.1? Which should you use?
Answer:
Both reduce randomness but work differently:
Temperature scales the logit distribution before sampling:
low temp → peaks sharpen → top tokens dominate
Top-p (nucleus sampling) samples from smallest set of tokens
that together have probability ≥ p:
top_p=0.1 → sample only from top 10% of probability mass
Temperature changes the shape of the distribution.
Top-p truncates the tail of the distribution.
Practical difference:
─────────────────────────────────────────────────────
Scenario: token distribution is flat (model uncertain)
temp=0.1 → still picks from flat, but with less randomness
top_p=0.1 → only considers top 10% — highly restrictive
Scenario: one token dominates (model confident)
temp=0.1 → picks that token
top_p=0.1 → effectively same result
─────────────────────────────────────────────────────
Recommendation:
Use temperature alone for most tasks.
Use top_p=0.9 with temperature=1 for creative tasks (OpenAI's recommended approach).
Do NOT use both temperature reduction + top_p reduction simultaneously — double-constrains output.Q6: Why do LLMs fail at arithmetic but pass math exams?
Answer:
LLMs fail at pure arithmetic because:
- They tokenize numbers non-intuitively ("1234" → 3 tokens)
- They learned math reasoning from text, not from a calculator
- No symbolic manipulation — just pattern matching on digit strings
But pass math exams because:
- Exams test conceptual understanding + problem setup, not just arithmetic
- Chain-of-thought prompting lets them reason through steps
- They can recall formulas, theorems, and approaches from training data
- Final arithmetic errors sometimes don't affect the exam score
In production: Use tool calling (code interpreter) for any actual arithmetic:
python# Don't: "What is 17 × 23 × 41?"
# LLM might say 16081 (wrong)
# Do: have the LLM call a calculator tool
tools = [{"name": "calculate", "function": eval_math_expression}]
# LLM outputs: calculate("17 * 23 * 41")
# Tool returns: 16031 (correct)Q7: Can you explain why "bigger model is always better" is wrong?
Answer:
Common counterexamples:
Task simplicity — GPT-4o for spam detection is overkill. A fine-tuned small model (BERT, DistilBERT) is faster, cheaper, and can be more accurate for that specific task.
Latency constraints — Real-time applications (autocomplete, voice) need <200ms. Smaller quantized models run locally, larger models require API round-trips.
Cost at scale — At 1M requests/day, GPT-4o ($5/M tokens input) vs GPT-3.5 ($0.50/M tokens) = 10x cost difference. A fine-tuned small model can be hosted for fixed cost.
Distillation can match — GPT-4-generated synthetic data used to fine-tune smaller models can approach GPT-4 quality on specific domains.
Diminishing returns — On many benchmarks, GPT-4 → GPT-3.5 is a small quality gap vs a 10x cost gap.
The right question: What's the smallest model that meets quality requirements at acceptable cost/latency for this specific task?