AI Fundamentals — Interview Questions
Common Questions
Q1: What is the difference between AI, Machine Learning, and Deep Learning?
Answer:
AI ⊃ ML ⊃ Deep Learning
AI — any technique that enables machines to mimic human intelligence
(rule-based systems count as AI too)
ML — a subset of AI where systems learn from data without being explicitly programmed
(algorithm finds patterns in data to make predictions)
Deep Learning — a subset of ML using multi-layer neural networks
(automatically extracts hierarchical features)
Example:
Spam filter using hand-coded rules → AI (not ML)
Spam filter using logistic regression on email features → ML
Spam filter using BERT to understand email content → Deep LearningQ2: What is a transformer and why did it replace RNNs for language tasks?
Answer:
RNNs had two key problems:
- Sequential processing — token N can't be processed until token N-1 is done → slow
- Vanishing gradients — gradients shrink as they flow back through long sequences → forgets early context
Transformers solve both:
- Parallel processing — all tokens processed simultaneously (attention computed in parallel)
- Self-attention — every token directly attends to every other token → no vanishing gradient over distance
python# RNN (sequential — can't parallelize)
h_t = tanh(W_h * h_{t-1} + W_x * x_t) # must wait for h_{t-1}
# Attention (parallel — all tokens at once)
Attention(Q, K, V) = softmax(QKᵀ / √d_k) * V
# Q, K, V computed for all tokens simultaneouslyQ3: What is hallucination and how do you mitigate it?
Answer:
Hallucination is when an LLM generates fluent-sounding but factually incorrect information. It happens because models are trained to produce probable next tokens, not verified facts.
Mitigation strategies:
| Strategy | How it works |
|---|---|
| RAG | Ground answers in retrieved documents |
| Lower temperature | More deterministic outputs |
| Chain-of-thought | Prompt model to reason step-by-step |
| Self-consistency | Generate multiple answers, take majority |
| Fact-checking layer | Second LLM call to verify claims |
| Structured output | JSON schema forces specific format |
| Prompt: "Say I don't know" | Explicitly instruct model to express uncertainty |
python# Example: RAG to reduce hallucination
context = retrieve_relevant_docs(query)
prompt = f"""
Answer ONLY based on the following context.
If the answer is not in the context, say "I don't know."
Context: {context}
Question: {query}
"""Q4: Explain the difference between fine-tuning and RAG. When do you use each?
Answer:
Fine-tuning RAG
───────────────────────────────── ─────────────────────────────────
Updates model weights No weight updates
Learns style/behavior/format Learns facts at query time
Static knowledge (point in time) Dynamic — update docs anytime
Expensive to update Cheap to update
Good for: tone, format, domain Good for: factual Q&A, up-to-date info
vocabulary private knowledge bases
Example use cases:
Fine-tune: customer service tone, RAG: "What's in our docs?",
code style, legal legal precedent lookup,
document format product FAQ, company policiesRule of thumb: If you need the model to know something → RAG. If you need the model to behave differently → fine-tune.
Q5: What is the context window and why does it matter in production?
Answer:
The context window is the maximum number of tokens an LLM can process in one request (input + output combined).
Model Context Window
─────────────────────────────────
GPT-3.5 16k tokens
GPT-4o 128k tokens
Claude 3.5 200k tokens
Gemini 1.5 Pro 1M tokens
1 page of text ≈ 750 tokens
Full novel (300 pages) ≈ 225k tokensWhy it matters:
- Cost — every token costs money (both input and output)
- Latency — more tokens = slower response
- "Lost in the middle" — LLMs recall beginning and end of context better than middle
- Chunking strategy — determines how you split docs for RAG
Q6: What is tokenization and why do some words cost more tokens than others?
Answer:
Tokenization splits text into subword units using algorithms like BPE (Byte Pair Encoding) or SentencePiece.
python# English - efficient
"Hello" → 1 token
"running" → 1 token
# Non-English - less efficient (less training data → smaller chunks)
"こんにちは" (Japanese: hello) → 3 tokens
"مرحبا" (Arabic: hello) → 4 tokens
# Rare words - split into subwords
"supercalifragilistic" → 6 tokens
"GPT-4o" → 4 tokens
# Numbers treated character by character
"12345" → 3 tokens
"1, 2, 3, 4, 5" → 9 tokensProduction impact: APIs price by token. A prompt with many numbers, special characters, or non-English text costs more than expected.
Q7: What is temperature and when do you change it?
Answer:
Temperature controls the randomness of token sampling.
Temperature 0.0 → always pick highest-probability token (deterministic)
Temperature 0.7 → balanced (default for most tasks)
Temperature 1.0 → sample proportionally from distribution
Temperature 2.0 → very random / creative / incoherent
Use cases:
─────────────────────────────────────────────────
Task Recommended Temp
─────────────────────────────────────────────────
Code generation 0.0 - 0.2
Factual Q&A / RAG 0.0 - 0.3
Summarization 0.3 - 0.5
General chat 0.7
Creative writing 0.9 - 1.2
Brainstorming 1.0 - 1.5Q8: What is the difference between embeddings and one-hot encoding?
Answer:
One-hot encoding:
"cat" → [0, 0, 1, 0, 0, 0, 0, ...] (50000 zeros, one 1)
"dog" → [0, 1, 0, 0, 0, 0, 0, ...]
Problem: no similarity — "cat" and "dog" are equally "different"
high dimensional — one dimension per vocabulary word
Embeddings:
"cat" → [0.2, -0.4, 0.8, 0.1, ...] (dense, 768-3072 dimensions)
"dog" → [0.3, -0.3, 0.7, 0.2, ...] (similar vector!)
"car" → [-0.5, 0.9, -0.2, 0.8, ...] (different direction)
Benefits:
- Capture semantic similarity (cosine similarity)
- Compress vocabulary into manageable dimensions
- Learnable — model learns what dimensions mean
- Transferable — pre-trained embeddings work across tasksQ9: What is RLHF and how does it make models like ChatGPT?
Answer:
RLHF (Reinforcement Learning from Human Feedback) is a 3-stage process:
Stage 1: Supervised Fine-tuning (SFT)
- Human labelers write ideal responses to prompts
- Base model fine-tuned on these (prompt, ideal response) pairs
- Model learns the desired format/style
Stage 2: Reward Model Training
- Given prompt + multiple responses, human ranks them
- Reward model learns to predict human preference score
- e.g. "Response A is better than B" → reward model learns why
Stage 3: PPO (Proximal Policy Optimization)
- Use RL to maximize reward model score
- Model generates responses → reward model scores them
- Model updates toward higher-scoring behavior
- KL divergence penalty: don't drift too far from SFT modelWithout RLHF, base models complete text probabilistically and may produce harmful, biased, or unhelpful output. RLHF aligns the model to be helpful, harmless, and honest.