Generative AI
What is Generative AI?
Generative AI is a class of ML models that learn to produce new content — text, images, audio, video, code — by learning the statistical patterns of training data.
Discriminative AI (classic ML):
Input → Classify / Predict a label
e.g. "Is this email spam?" → Yes / No
Generative AI:
Input (prompt/noise) → Generate new data
e.g. "Write a product description" → novel text outputThe key insight: generative models learn the probability distribution of training data, then sample from it.
Types of Generative Models
Generative Models
├── Language Models (LLMs)
│ ├── Generate text token-by-token
│ ├── Examples: GPT-4, Claude, Gemini, Llama 3
│ └── Architecture: Transformer (decoder-only)
│
├── Diffusion Models
│ ├── Generate images/audio by iteratively denoising
│ ├── Examples: Stable Diffusion, DALL-E 3, Sora (video)
│ └── Process: add noise → learn to reverse it
│
├── GANs (Generative Adversarial Networks)
│ ├── Generator vs Discriminator in competition
│ ├── Examples: StyleGAN, early deepfakes
│ └── Status: largely replaced by diffusion for images
│
├── VAEs (Variational Autoencoders)
│ ├── Encode to latent space, decode to output
│ └── Used in image compression, style transfer
│
└── Flow Models
├── Invertible transformations on data
└── Used in audio (WaveGlow), density estimationThe Transformer: Foundation of Modern GenAI
The transformer architecture (2017, "Attention is All You Need") powers virtually every modern LLM.
Input Tokens
│
▼
┌─────────────────────────────────────┐
│ Embedding Layer │
│ token IDs → dense vectors (d_model) │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Positional Encoding │
│ adds position info (sin/cos or RoPE)│
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ N × Transformer Blocks │
│ ┌─────────────────────────────┐ │
│ │ Multi-Head Self-Attention │ │
│ │ + Residual + LayerNorm │ │
│ ├─────────────────────────────┤ │
│ │ Feed-Forward Network (FFN) │ │
│ │ + Residual + LayerNorm │ │
│ └─────────────────────────────┘ │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Output Head (LM head) │
│ hidden state → vocab logits → token │
└─────────────────────────────────────┘Self-Attention in Plain English
For each token, attention asks: "which other tokens in this sequence should I pay attention to?"
Query (Q): "What am I looking for?"
Key (K): "What do I contain?"
Value (V): "What do I pass forward if matched?"
Attention(Q, K, V) = softmax(QK^T / √d_k) × VMulti-head attention runs this in parallel across H heads — each head learns different types of relationships (syntax, coreference, long-range dependencies).
LLM Training Pipeline
Stage 1: Pre-training
Data: trillions of tokens (web, books, code)
Task: next-token prediction (self-supervised)
Cost: millions of dollars, weeks of compute
Output: a base model (knows language, not instructions)
Stage 2: Supervised Fine-Tuning (SFT)
Data: curated (prompt, ideal response) pairs
Task: train model to follow instructions
Output: instruction-following model
Stage 3: RLHF (Reinforcement Learning from Human Feedback)
Data: human preference rankings of model outputs
Reward model: learns to score outputs like a human would
RL loop: optimize model against reward model (PPO)
Output: aligned, helpful, less harmful model
Alternative to RLHF: DPO (Direct Preference Optimization)
No reward model needed — directly optimizes on preference pairs
More stable, cheaper, increasingly common (Llama 3, Mistral)Key Concepts and Parameters
Tokens
LLMs don't process characters — they process tokens (subword units).
"Hello, world!" → ["Hello", ",", " world", "!"] = 4 tokens
"unbelievable" → ["un", "believ", "able"] = 3 tokens
Rules of thumb:
1 token ≈ 4 characters (English)
1 token ≈ 0.75 words
100 tokens ≈ 75 wordsTokenization matters because:
- Context windows are measured in tokens, not words
- Non-English text is often tokenized less efficiently
- Pricing is per-token
Context Window
The maximum number of tokens the model can process at once (input + output combined).
GPT-3.5: 4K → 16K tokens
GPT-4: 8K → 128K tokens
Claude 3: 200K tokens
Gemini 1.5: 1M tokensLonger context ≠ better attention. Models often struggle with information in the middle of very long contexts (lost-in-the-middle problem).
Temperature
Controls randomness in output sampling.
temperature = 0.0 → deterministic, always picks highest-probability token
temperature = 0.7 → balanced, good for most tasks
temperature = 1.0 → sampling from full distribution
temperature = 2.0 → very random, often incoherent
Use low temp for: factual Q&A, code, structured output
Use high temp for: creative writing, brainstormingTop-P (Nucleus Sampling)
Only sample from tokens that together account for top P% of probability mass.
top_p = 0.9 → consider tokens until cumulative prob = 90%
cuts off the long tail of unlikely tokens
Often used with temperature (both together, not just one)Generative AI vs Traditional Software
| Aspect | Traditional Software | Generative AI |
|---|---|---|
| Output | Deterministic | Probabilistic |
| Debugging | Read the code | Prompt engineering + eval |
| Correctness | Binary (right/wrong) | Spectrum (quality varies) |
| Versioning | Code versions | Model versions + prompts |
| Testing | Unit/integration tests | Evals, human preference |
| Failure mode | Errors/exceptions | Hallucinations, drift |
Generative AI Applications by Domain
Text Generation
├── Chatbots and assistants (customer support, coding)
├── Summarization, translation, classification
├── Content generation (marketing, documentation)
└── Code generation (GitHub Copilot, Cursor)
Image Generation
├── Text-to-image (Midjourney, DALL-E, Stable Diffusion)
├── Image editing, inpainting, style transfer
└── Product photography, design mockups
Audio / Video
├── Text-to-speech (ElevenLabs, OpenAI TTS)
├── Music generation (Suno, Udio)
└── Video generation (Sora, Runway)
Code
├── Autocomplete (Copilot, Cursor, Codeium)
├── Test generation, refactoring
└── SQL, regex, shell script generation