AI Fundamentals for Software Engineers
The AI Landscape — What You Actually Need to Know
AI is not one thing. As a software engineer, you need to understand where LLMs sit in the bigger picture and why certain architectural decisions exist.
Types of Artificial Intelligence
┌──────────────────────────────────────────────────────────┐
│ Types of AI by Scope │
│ │
│ Narrow AI (ANI) → does one thing very well │
│ e.g. GPT-4, DALL-E, AlphaGo, Spotify recommendations │
│ │
│ General AI (AGI) → human-level reasoning │
│ e.g. hypothetical — does NOT exist yet │
│ │
│ Super AI (ASI) → beyond human intelligence │
│ e.g. science fiction — does NOT exist yet │
└──────────────────────────────────────────────────────────┘Everything you use today is Narrow AI. GPT-4, Claude, Gemini — all narrow AI systems that are very good at language tasks.
Types of Machine Learning
Machine Learning
├── Supervised Learning
│ ├── Labeled data (input → expected output)
│ ├── Examples: spam detection, image classification, price prediction
│ └── Algorithms: linear regression, decision trees, neural nets
│
├── Unsupervised Learning
│ ├── No labels — find patterns in raw data
│ ├── Examples: customer clustering, anomaly detection
│ └── Algorithms: k-means, DBSCAN, PCA, autoencoders
│
├── Semi-supervised Learning
│ ├── Small labeled set + large unlabeled set
│ └── Examples: GPT pre-training (predict next token — self-supervised)
│
├── Reinforcement Learning (RL)
│ ├── Agent takes actions in environment, gets reward/penalty
│ ├── Examples: game playing (AlphaGo), robot control, RLHF
│ └── Key: exploration vs exploitation tradeoff
│
└── Self-supervised Learning
├── Labels generated from the data itself
└── Foundation of LLM training (next-token prediction)Neural Networks & Deep Learning
The Neuron Analogy (simplified)
Input features × Weights + Bias → Activation Function → Output
x₁ ──w₁──┐
x₂ ──w₂──┤──[Σ + bias]──[ReLU/Sigmoid]──→ output
x₃ ──w₃──┘Key Concepts
| Term | Meaning | Why it matters |
|---|---|---|
| Layer | Group of neurons processing together | Depth = more abstract features learned |
| Weights | Learnable parameters | What gets updated during training |
| Backpropagation | Gradient flows backward to update weights | How neural nets learn |
| Gradient Descent | Minimize loss by moving toward gradient | Optimization mechanism |
| Overfitting | Model memorizes training data | Use dropout, regularization, more data |
| Epoch | One full pass over training data | More epochs ≠ always better |
| Batch Size | Samples processed before weight update | Affects speed vs stability |
| Learning Rate | How big each update step is | Too high = diverge, too low = slow |
| Loss Function | Measures prediction error | Cross-entropy (classification), MSE (regression) |
Transformers — The Architecture Behind LLMs
The 2017 paper "Attention Is All You Need" changed everything. Transformers replaced RNNs for language tasks.
Input Text: "The cat sat on the"
↓
Token IDs: [464, 3797, 3332, 319, 262]
↓
Embeddings: each token → dense vector (e.g. 768 dimensions)
↓
┌─────────────────────────────────────┐
│ Transformer Block ×N │
│ │
│ Multi-Head Self-Attention │
│ ↓ (which tokens relate to which) │
│ Feed-Forward Network │
│ ↓ (learn complex transformations) │
│ Layer Norm + Residual Connection │
└─────────────────────────────────────┘
↓
Output: probability distribution over vocabulary
↓
Next token: "mat" (highest probability)Self-Attention in Plain English
For each token, attention asks: "How much should I care about every other token in the sequence?"
"The bank can guarantee deposits will eventually cover future tuition costs"
bank → looks at: deposits (high), guarantee (high), tuition (medium)
(understands "bank" means financial institution, not river bank)This is context-aware understanding — why LLMs can disambiguate meaning.
LLMs (Large Language Models)
What They Are
LLMs are autoregressive transformers trained on massive text corpora to predict the next token. They learn statistical patterns so rich that emergent capabilities (reasoning, coding, translation) arise without being explicitly programmed.
Key Models
| Model | Company | Notable for |
|---|---|---|
| GPT-4o | OpenAI | General purpose, multimodal |
| Claude 3.5/4 | Anthropic | Long context, instruction following |
| Gemini 1.5/2 | Multimodal, 1M token context | |
| Llama 3 | Meta | Open source, self-hostable |
| Qwen3 | Alibaba | Strong on Asian languages + code |
| Mistral | Mistral AI | Efficient, open weights |
Token Economics
"Hello, world!" → ["Hello", ",", " world", "!"] → 4 tokens
Rule of thumb: 1 token ≈ 0.75 words ≈ 4 characters (English)
Why it matters:
- Pricing is per token (input + output)
- Context window limits (GPT-4: 128k, Claude 3.5: 200k, Gemini: 1M+)
- Longer context = more expensive + slowerTraining vs Fine-tuning vs Inference
TRAINING (pre-training)
─ Train from scratch on trillions of tokens
─ Cost: millions of dollars, months on 1000s of GPUs
─ Who does this: OpenAI, Google, Meta, Anthropic
─ Result: foundation model (base weights)
FINE-TUNING
─ Further train a pre-trained model on specific data
─ Cost: hundreds to thousands of dollars
─ Types:
• Full fine-tuning: update all weights
• LoRA/QLoRA: update small adapter matrices (PEFT)
• RLHF: use human feedback to align behavior
• DPO: direct preference optimization
INFERENCE
─ Run the trained model to generate output
─ Cost: per request (API) or GPU rental
─ Optimization: quantization (INT8/INT4), KV caching, batchingRLHF (Reinforcement Learning from Human Feedback)
How ChatGPT gets its helpful, harmless behavior:
1. Pre-train base model on internet data
2. Supervised fine-tuning: train on (prompt, ideal response) pairs
3. Reward model: human raters rank multiple responses
4. PPO: optimize base model to maximize reward model scoreAI in Production — The Developer Perspective
User Request
↓
Application Layer (Node.js/Python)
↓
Prompt Construction
↓
LLM API Call (OpenAI/Anthropic/Gemini)
↓
Response Parsing + Validation
↓
Post-Processing (tool calls? RAG? memory?)
↓
Response to UserLatency vs Quality Trade-offs
| Optimization | Trade-off |
|---|---|
| Smaller model (GPT-3.5 vs GPT-4) | Faster & cheaper, but less capable |
| Streaming responses | Better UX, same total latency |
| Caching identical prompts | Zero latency repeat queries, staleness risk |
| Shorter prompts | Faster + cheaper, less context |
| Parallel LLM calls | Faster multi-step pipelines, higher cost |
Evaluation Metrics
| Metric | Use Case |
|---|---|
| BLEU | Machine translation quality |
| ROUGE | Summarization quality |
| Perplexity | How well model predicts test data (lower = better) |
| Human eval | Gold standard for LLM outputs |
| LLM-as-judge | Use GPT-4 to evaluate other model outputs |
| Faithfulness | RAG: answer supported by retrieved context? |
| Relevance | RAG: context retrieved actually relevant? |
Key Terms Cheat Sheet
| Term | Definition |
|---|---|
| Hallucination | Model confidently states false information |
| Context window | Max tokens model can process at once |
| Temperature | Randomness of output (0=deterministic, 2=chaotic) |
| Top-p (nucleus sampling) | Sample from top-p probability mass |
| Top-k | Sample from top-k most likely tokens |
| Embeddings | Dense vector representation of text/data |
| Token | Smallest unit of text an LLM processes |
| Tokenizer | Splits text into tokens (BPE, WordPiece) |
| Quantization | Reduce model precision (FP32→INT8) to save memory |
| KV Cache | Cache attention keys/values to speed up inference |
| Grounding | Connecting model output to verified facts/sources |
| Guardrails | Rules/classifiers to prevent harmful outputs |