Local LLM Hardware Sizing and Model Selection
Choosing a local model is mostly a hardware-matching problem.
The practical question is not:
What is the biggest model I can technically load?It is:
What model gives acceptable quality, latency, and memory usage on my machine?The Three Main Constraints
- RAM or VRAM
- throughput (tokens/sec)
- context length
You need all three to feel acceptable in practice.
Very Rough Model Sizing
Quantized local models are usually discussed in families like:
| Model size | Typical use |
|---|---|
| 1B–3B | very fast, lightweight assistants, edge devices |
| 7B–8B | strong sweet spot for local chat/code |
| 13B–14B | noticeably better, but more demanding |
| 32B+ | stronger quality, much heavier serving needs |
| 70B | premium local quality, serious hardware requirement |
For many developers, 7B–8B is the first truly practical tier.
CPU vs GPU vs Apple Silicon
CPU
- easiest to access
- slower
- fine for smaller quantized models
GPU
- best for throughput
- especially important for larger models and serving multiple users
Apple Silicon
- very strong local inference story because of unified memory
- often better than people expect for solo local use
Model Selection by Task
General assistant
Choose a solid instruct model in the 7B–8B range first.
Coding
Prefer coder-tuned models such as Qwen Coder or DeepSeek Coder families when available.
Reasoning-heavy tasks
Reasoning-oriented models can be excellent, but they are often slower and more verbose.
Embeddings
Do not use a chat model as an embedding model. Pick a dedicated embedding model.
What Usually Goes Wrong
- choosing a model too large for comfortable throughput
- optimizing only for benchmark quality, ignoring latency
- forgetting context length memory costs
- using the wrong model family for the task
Good Practical Rule
Start with:
- one small fast model
- one main 7B–8B model
- one embedding model
Then upgrade only if your actual use case needs it.
Interview Answer
How do you choose a local model?
Match the model to the hardware, task, and latency target. For most solo developers, a quantized 7B–8B instruct model is the best starting point, with dedicated coder or embedding models added for specialized workloads.