logodev atlas
6 min read

AI Fundamentals for Software Engineers

The AI Landscape — What You Actually Need to Know

AI is not one thing. As a software engineer, you need to understand where LLMs sit in the bigger picture and why certain architectural decisions exist.


Types of Artificial Intelligence

┌──────────────────────────────────────────────────────────┐
│                   Types of AI by Scope                   │
│                                                          │
│  Narrow AI (ANI)   →  does one thing very well          │
│  e.g. GPT-4, DALL-E, AlphaGo, Spotify recommendations   │
│                                                          │
│  General AI (AGI)  →  human-level reasoning             │
│  e.g. hypothetical — does NOT exist yet                  │
│                                                          │
│  Super AI (ASI)    →  beyond human intelligence         │
│  e.g. science fiction — does NOT exist yet               │
└──────────────────────────────────────────────────────────┘

Everything you use today is Narrow AI. GPT-4, Claude, Gemini — all narrow AI systems that are very good at language tasks.


Types of Machine Learning

Machine Learning
├── Supervised Learning
│   ├── Labeled data (input → expected output)
│   ├── Examples: spam detection, image classification, price prediction
│   └── Algorithms: linear regression, decision trees, neural nets
│
├── Unsupervised Learning
│   ├── No labels — find patterns in raw data
│   ├── Examples: customer clustering, anomaly detection
│   └── Algorithms: k-means, DBSCAN, PCA, autoencoders
│
├── Semi-supervised Learning
│   ├── Small labeled set + large unlabeled set
│   └── Examples: GPT pre-training (predict next token — self-supervised)
│
├── Reinforcement Learning (RL)
│   ├── Agent takes actions in environment, gets reward/penalty
│   ├── Examples: game playing (AlphaGo), robot control, RLHF
│   └── Key: exploration vs exploitation tradeoff
│
└── Self-supervised Learning
    ├── Labels generated from the data itself
    └── Foundation of LLM training (next-token prediction)

Neural Networks & Deep Learning

The Neuron Analogy (simplified)

Input features × Weights + Bias → Activation Function → Output

x₁ ──w₁──┐
x₂ ──w₂──┤──[Σ + bias]──[ReLU/Sigmoid]──→ output
x₃ ──w₃──┘

Key Concepts

Term Meaning Why it matters
Layer Group of neurons processing together Depth = more abstract features learned
Weights Learnable parameters What gets updated during training
Backpropagation Gradient flows backward to update weights How neural nets learn
Gradient Descent Minimize loss by moving toward gradient Optimization mechanism
Overfitting Model memorizes training data Use dropout, regularization, more data
Epoch One full pass over training data More epochs ≠ always better
Batch Size Samples processed before weight update Affects speed vs stability
Learning Rate How big each update step is Too high = diverge, too low = slow
Loss Function Measures prediction error Cross-entropy (classification), MSE (regression)

Transformers — The Architecture Behind LLMs

The 2017 paper "Attention Is All You Need" changed everything. Transformers replaced RNNs for language tasks.

Input Text: "The cat sat on the"
     ↓
Token IDs: [464, 3797, 3332, 319, 262]
     ↓
Embeddings: each token → dense vector (e.g. 768 dimensions)
     ↓
┌─────────────────────────────────────┐
│        Transformer Block ×N         │
│                                     │
│  Multi-Head Self-Attention          │
│  ↓  (which tokens relate to which) │
│  Feed-Forward Network               │
│  ↓  (learn complex transformations) │
│  Layer Norm + Residual Connection   │
└─────────────────────────────────────┘
     ↓
Output: probability distribution over vocabulary
     ↓
Next token: "mat" (highest probability)

Self-Attention in Plain English

For each token, attention asks: "How much should I care about every other token in the sequence?"

"The bank can guarantee deposits will eventually cover future tuition costs"

bank → looks at: deposits (high), guarantee (high), tuition (medium)
       (understands "bank" means financial institution, not river bank)

This is context-aware understanding — why LLMs can disambiguate meaning.


LLMs (Large Language Models)

What They Are

LLMs are autoregressive transformers trained on massive text corpora to predict the next token. They learn statistical patterns so rich that emergent capabilities (reasoning, coding, translation) arise without being explicitly programmed.

Key Models

Model Company Notable for
GPT-4o OpenAI General purpose, multimodal
Claude 3.5/4 Anthropic Long context, instruction following
Gemini 1.5/2 Google Multimodal, 1M token context
Llama 3 Meta Open source, self-hostable
Qwen3 Alibaba Strong on Asian languages + code
Mistral Mistral AI Efficient, open weights

Token Economics

"Hello, world!" → ["Hello", ",", " world", "!"] → 4 tokens

Rule of thumb: 1 token ≈ 0.75 words ≈ 4 characters (English)

Why it matters:
- Pricing is per token (input + output)
- Context window limits (GPT-4: 128k, Claude 3.5: 200k, Gemini: 1M+)
- Longer context = more expensive + slower

Training vs Fine-tuning vs Inference

TRAINING (pre-training)
  ─ Train from scratch on trillions of tokens
  ─ Cost: millions of dollars, months on 1000s of GPUs
  ─ Who does this: OpenAI, Google, Meta, Anthropic
  ─ Result: foundation model (base weights)

FINE-TUNING
  ─ Further train a pre-trained model on specific data
  ─ Cost: hundreds to thousands of dollars
  ─ Types:
    • Full fine-tuning: update all weights
    • LoRA/QLoRA: update small adapter matrices (PEFT)
    • RLHF: use human feedback to align behavior
    • DPO: direct preference optimization

INFERENCE
  ─ Run the trained model to generate output
  ─ Cost: per request (API) or GPU rental
  ─ Optimization: quantization (INT8/INT4), KV caching, batching

RLHF (Reinforcement Learning from Human Feedback)

How ChatGPT gets its helpful, harmless behavior:

1. Pre-train base model on internet data
2. Supervised fine-tuning: train on (prompt, ideal response) pairs
3. Reward model: human raters rank multiple responses
4. PPO: optimize base model to maximize reward model score

AI in Production — The Developer Perspective

User Request
    ↓
Application Layer (Node.js/Python)
    ↓
Prompt Construction
    ↓
LLM API Call (OpenAI/Anthropic/Gemini)
    ↓
Response Parsing + Validation
    ↓
Post-Processing (tool calls? RAG? memory?)
    ↓
Response to User

Latency vs Quality Trade-offs

Optimization Trade-off
Smaller model (GPT-3.5 vs GPT-4) Faster & cheaper, but less capable
Streaming responses Better UX, same total latency
Caching identical prompts Zero latency repeat queries, staleness risk
Shorter prompts Faster + cheaper, less context
Parallel LLM calls Faster multi-step pipelines, higher cost

Evaluation Metrics

Metric Use Case
BLEU Machine translation quality
ROUGE Summarization quality
Perplexity How well model predicts test data (lower = better)
Human eval Gold standard for LLM outputs
LLM-as-judge Use GPT-4 to evaluate other model outputs
Faithfulness RAG: answer supported by retrieved context?
Relevance RAG: context retrieved actually relevant?

Key Terms Cheat Sheet

Term Definition
Hallucination Model confidently states false information
Context window Max tokens model can process at once
Temperature Randomness of output (0=deterministic, 2=chaotic)
Top-p (nucleus sampling) Sample from top-p probability mass
Top-k Sample from top-k most likely tokens
Embeddings Dense vector representation of text/data
Token Smallest unit of text an LLM processes
Tokenizer Splits text into tokens (BPE, WordPiece)
Quantization Reduce model precision (FP32→INT8) to save memory
KV Cache Cache attention keys/values to speed up inference
Grounding Connecting model output to verified facts/sources
Guardrails Rules/classifiers to prevent harmful outputs
[prev·next]