6 min read
AI Fundamentals — Deep Dive
How Transformers Actually Work
The attention mechanism (the core of every modern LLM):
For each token in the input:
1. Create Q (query), K (key), V (value) vectors
2. Compute attention scores: Q × K^T / √d_k
3. Softmax → attention weights (sum to 1)
4. Weighted sum of V vectors = attended output
Intuition:
Q = "what am I looking for?"
K = "what information do I have?"
V = "what information do I give?"
"The cat sat because it was tired"
When processing "it" → high attention to "cat"
Q("it") has high similarity with K("cat")
Multi-head attention:
Run attention multiple times in parallel with different learned projections
Each "head" can learn different types of relationships:
Head 1: syntactic (subject-verb)
Head 2: semantic (synonyms)
Head 3: positional (nearby tokens)
Concatenate all heads → linear transform
Why this matters in interviews:
"Transformers can attend to all positions simultaneously" (vs RNNs, sequential)
"Attention is O(n²) in context length" (why long context is expensive)
"KV cache stores K,V for previous tokens, so each new token is fast"Tokenization in Depth
python# Tokens ≠ words. Understanding tokenization prevents real bugs.
import tiktoken
enc = tiktoken.get_encoding("cl100k_base") # GPT-4 tokenizer
# Words → tokens (not 1:1)
examples = {
"hello": 1, # Common word = 1 token
"tokenization": 3, # ["token", "ization"] = 3 tokens
"supercalifragilistic": 5, # Rare = many tokens
"2024-01-15": 5, # Dates are expensive
" hello": 1, # Space is part of the token (GPT uses byte-pair encoding)
"hello ": 2, # Trailing space = separate token
}
# Token counting — critical for cost estimation
def count_tokens(text: str, model: str = "gpt-4") -> int:
encoding = tiktoken.encoding_for_model(model)
return len(encoding.encode(text))
# Code is tokenized differently than prose
code_snippet = "def add(a, b):\n return a + b"
# → ~10 tokens (common keywords = fewer tokens)
# Non-English is MORE expensive
chinese_text = "你好世界" # → 5 tokens (vs 2 tokens for "hello world")
arabic_text = "مرحبا" # → 6 tokens
# Practical implications:
# 1. Cost: 1M tokens ≠ 1M words (divide word count by ~0.75 for rough token estimate)
# 2. Context limits: "128k context" ≠ "128k words"
# 3. Some operations inflate token count (JSON has lots of quotes, brackets)
# Reduce token count:
# ✓ Remove whitespace/formatting in prompts where not needed
# ✓ Use abbreviations in system prompts
# ✓ Prefer "Respond in JSON" over repeating schema twiceContext Windows — What They Really Mean
Context window = total tokens the model "sees" at once
= input tokens + output tokens
Model Context Window Effective Use
─────────────────────────────────────────────────────
GPT-4o 128k ~100k pages of text
Claude 3.5/4 200k ~150k pages of text
Gemini 1.5 Pro 1M ~750k pages of text
GPT-4o-mini 128k Same as GPT-4o
Why context limits matter:
1. Cost: you pay for EVERY token in context on EVERY request
Chat with 10 turns of 500 tokens each + system prompt = 5500 input tokens per message
At 100k messages/day: 550M tokens/day = ~$1,375/day at $2.50/M
2. The "lost in the middle" problem
LLMs pay less attention to information in the middle of context
Put critical info at start AND end, not buried in middle
3. Context ≠ memory
LLM has no persistent memory between separate API calls
Every call starts fresh — you must re-inject conversation history
4. KV cache (key-value cache)
After processing tokens, their K and V matrices are cached
Subsequent tokens are fast (O(1) per new token, not O(n²))
Prompt caching exploits this: static content cached = cheaper
Strategies for long contexts:
Short context (<8k): Send everything
Medium (8k-32k): Summarize older history
Long (>32k): RAG — retrieve relevant sections only
Very long (>128k): Map-reduce (process in chunks, combine summaries)Temperature, Top-p, and Sampling
python# These control randomness in generation
# Temperature:
# temperature = 0: deterministic (always pick highest probability token)
# temperature = 0.7: balanced — typical for chat
# temperature = 1.0: default distribution
# temperature = 2.0: very random, often incoherent
# Intuition: divides logits before softmax
# Low temp → probabilities become more peaked (winner takes more)
# High temp → probabilities flatten (more randomness)
USE_CASES = {
"factual Q&A": {"temperature": 0},
"code generation": {"temperature": 0.1},
"customer support": {"temperature": 0.3},
"creative writing": {"temperature": 0.8},
"brainstorming": {"temperature": 1.0},
}
# Top-p (nucleus sampling):
# Only sample from tokens whose cumulative probability ≥ p
# top_p=0.9: only consider tokens in the top 90% probability mass
# More adaptive than temperature (dynamically adjusts vocab size)
# In practice:
# Change temperature OR top_p, not both
# temperature=0 overrides top_p (deterministic)
# Default: temperature=1, top_p=1 → normal sampling
# Max tokens (max_completion_tokens):
# Hard cap on output length
# Set for cost control AND to prevent runaway responses
# For JSON extraction: set low (500-1000)
# For long-form generation: set appropriately (2000-4000)
# For reasoning (Claude extended thinking): set high (16000+)
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
temperature=0, # Deterministic for factual extraction
messages=[{"role": "user", "content": "What is 2+2?"}]
)Model Architecture Trade-offs
Scaling laws (Chinchilla, 2022):
Optimal training: use (model_params × 20) tokens of training data
GPT-3 (175B params) was undertrained — should have used 3.5T tokens
This motivated training larger datasets, not just larger models
Model size vs capability:
7B models: fast, cheap, good for simple tasks (classification, extraction)
13B models: better reasoning, still fits on consumer GPU
70B models: near GPT-4 quality on many tasks, needs multi-GPU
>100B: state of the art, massive infrastructure
Mixture of Experts (MoE) — how GPT-4 likely works:
Instead of one dense network, use many smaller "expert" networks
Router decides which experts to activate per token
GPT-4 rumored: 8 experts of 220B params each → 1.8T total
But only 2 experts activate per token → actual compute = ~440B
Benefits:
More parameters (capacity) for same compute
Can specialize experts for different types of content
Drawbacks:
Communication overhead in distributed training
Load balancing (some experts overloaded, others idle)
Quantization:
float32: 4 bytes per param (32 bits)
float16/bfloat16: 2 bytes per param → 2x memory savings
int8: 1 byte per param → 4x memory, slight quality loss
int4 (GPTQ/AWQ): 0.5 bytes → 8x memory, visible quality loss
NF4 (QLoRA): 0.5 bytes, better for activations
7B model memory requirements:
float32: 28 GB
float16: 14 GB (standard deployment)
int8: 7 GB
int4: 3.5 GB (fine-tunable on consumer GPU)Embedding Models — What You Need to Know
python# Embeddings convert text → dense numerical vectors
# Semantically similar text → geometrically close vectors
# Comparing text via cosine similarity:
import numpy as np
def cosine_similarity(a: list[float], b: list[float]) -> float:
a, b = np.array(a), np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Why cosine similarity (not euclidean distance)?
# Cosine measures angle → direction of meaning
# Euclidean measures magnitude → affected by text length
# "cat" and "cat cat cat" → same direction, different magnitude
# Cosine: 1.0 (identical meaning) Euclidean: large distance
# Popular embedding models:
MODELS = {
"text-embedding-3-small": {
"dims": 1536, # Or truncate to 256/512
"tokens_limit": 8191,
"cost": "$0.02/M tokens", # Very cheap
"use_for": "General purpose, RAG"
},
"text-embedding-3-large": {
"dims": 3072,
"tokens_limit": 8191,
"cost": "$0.13/M tokens",
"use_for": "Higher accuracy for complex domains"
},
"nomic-embed-text": {
"dims": 768,
"tokens_limit": 8192,
"cost": "Free (local)",
"use_for": "Local/private deployment"
},
"BAAI/bge-m3": {
"dims": 1024,
"tokens_limit": 8192,
"cost": "Free (local)",
"use_for": "Multilingual, best open-source"
},
}
# Matryoshka embeddings (text-embedding-3 models):
# Can truncate dimensions without re-embedding!
# 1536 → 256 dimensions: 6x smaller, 90% of the quality
# Use for: high-volume scenarios where storage cost matters
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
model="text-embedding-3-small",
input="The quick brown fox",
dimensions=256, # Truncate to 256 dims (Matryoshka)
)
embedding = response.data[0].embedding # List of 256 floatsKey Numbers to Know in Interviews
Model Pricing (approximate, as of 2025):
GPT-4o: $2.50/M input, $10.00/M output
GPT-4o-mini: $0.15/M input, $0.60/M output
Claude Opus 4.6: $15.00/M input, $75.00/M output
Claude Sonnet 4.6: $3.00/M input, $15.00/M output
Claude Haiku 4.5: $0.80/M input, $4.00/M output
Groq Llama-3-70b: $0.59/M input, $0.79/M output (fastest)
Latency (typical, non-streaming):
GPT-4o-mini: 500ms-2s
GPT-4o: 1-4s
Claude Sonnet: 1-3s
Claude Opus: 2-8s
Groq Llama-3: 200-800ms (10x faster inference)
Context windows:
GPT-4o: 128k tokens
Claude 3.5+: 200k tokens
Gemini 1.5 Pro: 1M tokens
Approximate token-to-word ratio: 1 token ≈ 0.75 words
Embedding dimensions:
OpenAI small: 1536 (or 256-512 truncated)
OpenAI large: 3072
Most open source: 384-1024
RAG performance benchmarks (good system):
Faithfulness: > 0.85
Answer relevancy: > 0.80
Context precision: > 0.75
Retrieval latency (pgvector + HNSW): < 50ms for 1M docs[prev·next]