Local LLMs — Ollama, Quantization & Self-Hosted Inference

Running LLMs locally means no API costs, no data leaving your machine, and no rate limits. Essential for privacy-sensitive workloads and for experimenting with open-source models.

Why Run Locally?

Factor	Cloud API	Local
Cost	Per token (can be expensive)	Hardware cost (one-time)
Privacy	Data sent to vendor	Stays on-device
Latency	Network round trip	Low (VRAM limited)
Rate limits	Yes	No
Model choice	Provider's models only	Any open-source model
Offline use	❌	✅

Ollama — The Easiest Local LLM Server

Ollama is a CLI + server that manages local model downloads and serves an OpenAI-compatible REST API.

Setup

bash# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Start the server
ollama serve    # runs on http://localhost:11434

Pull and Run Models

bash# List available models
ollama list

# Pull models
ollama pull llama3.3          # Meta Llama 3.3 70B (4-bit quantized, ~40GB)
ollama pull llama3.2:3b       # 3B — runs on 4GB RAM
ollama pull mistral           # Mistral 7B
ollama pull qwen2.5:7b        # Qwen 2.5 7B — excellent for code
ollama pull deepseek-r1:7b    # DeepSeek R1 reasoning model
ollama pull phi4              # Microsoft Phi-4 14B — strong reasoning, small size
ollama pull nomic-embed-text  # Embedding model for RAG
ollama pull mxbai-embed-large # Better embeddings

# Interactive chat
ollama run llama3.2:3b

# One-shot
ollama run mistral "Explain RAG in 3 sentences"

Use via OpenAI SDK (Drop-in)

pythonfrom openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",   # required by SDK but ignored by Ollama
)

# Chat
response = client.chat.completions.create(
    model="llama3.2:3b",
    messages=[{"role": "user", "content": "Explain HNSW index"}],
)
print(response.choices[0].message.content)

# Embeddings
emb = client.embeddings.create(
    model="nomic-embed-text",
    input="What is HNSW?",
)
print(emb.data[0].embedding[:5])

Streaming with Ollama

pythonstream = client.chat.completions.create(
    model="qwen2.5:7b",
    messages=[{"role": "user", "content": "Write a FastAPI SSE endpoint"}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Ollama in Docker (for servers)

yaml# docker-compose.yml
services:
  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    # For GPU:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

volumes:
  ollama_data:

bashdocker compose up -d
docker exec -it ollama-ollama-1 ollama pull llama3.2:3b

LangChain + Ollama for Local RAG

pythonfrom langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Local LLM
llm = ChatOllama(model="llama3.2:3b", temperature=0)

# Local embeddings
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma(embedding_function=embeddings, persist_directory="./local_db")

# Local RAG chain — $0 per query
prompt = ChatPromptTemplate.from_template(
    "Answer using only this context:\n\n{context}\n\nQuestion: {question}"
)

chain = (
    {"context": vectorstore.as_retriever(), "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

print(chain.invoke("What is HNSW?"))

Quantization — Run Bigger Models on Less Hardware

Quantization reduces model weights from 32-bit floats to 4-bit integers. A 7B model goes from ~14GB to ~4GB at negligible quality loss.

Quantization Formats

Format	Creator	Quality	Speed	Use case
GGUF	llama.cpp	Good	CPU+GPU	Ollama, LM Studio
GPTQ	AutoGPTQ	Excellent	GPU only	Production GPU
AWQ	MIT HAN Lab	Best quality	GPU only	Production GPU
bitsandbytes	HuggingFace	Good	GPU only	HF transformers
GGML	(legacy)	—	—	Superseded by GGUF

Precision Comparison

Bits	Size (7B model)	Quality loss
fp32	~28GB	Baseline
fp16	~14GB	Negligible
int8	~7GB	Very small
int4 (Q4_K_M)	~4.1GB	Small — recommended
int3	~3GB	Moderate
int2	~2GB	Significant

HuggingFace bitsandbytes (4-bit)

pythonfrom transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NF4 quantization (best for LLMs)
    bnb_4bit_use_double_quant=True,       # nested quantization
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

inputs = tokenizer("Explain RAG:", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

AWQ (better quality than bitsandbytes)

pythonfrom awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "Qwen/Qwen2.5-7B-Instruct-AWQ"   # pre-quantized AWQ model from HF Hub

model = AutoAWQForCausalLM.from_quantized(model_path, fuse_layers=True, device_map="cuda")
tokenizer = AutoTokenizer.from_pretrained(model_path)

Hardware Requirements

CPU-only (Ollama GGUF Q4)

Model	RAM needed	Speed (approx)
1–3B	4GB	~20 tok/s on M1
7–8B	8GB	~10 tok/s on M1
13B	16GB	~5 tok/s on M1
70B	48GB	~1–2 tok/s on M1 Ultra

GPU (bitsandbytes/AWQ int4)

Model	VRAM needed
7–8B	6–8GB (3090/4070)
13B	10–12GB (3080 Ti)
70B	40–48GB (2×A100)

Apple Silicon (Unified Memory): macOS with MLX or Ollama can use the full unified memory pool — a MacBook Pro M3 Max with 128GB can run 70B models.

MLX — Apple Silicon Optimized

MLX is Apple's ML framework for M-series chips, optimized for unified memory.

bashpip install mlx-lm

pythonfrom mlx_lm import load, generate

model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")
response = generate(model, tokenizer, prompt="Explain HNSW", max_tokens=200)
print(response)

vLLM — Production-Grade Self-Hosted Server

vLLM is the standard for high-throughput LLM serving with PagedAttention.

bashpip install vllm

# Serve a model
vllm serve meta-llama/Llama-3.2-3B-Instruct \
    --dtype auto \
    --max-model-len 8192 \
    --port 8000

python# Drop-in OpenAI compatible
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="vllm")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
)

vLLM advantages: continuous batching (processes multiple requests simultaneously), PagedAttention (efficient KV cache), tensor parallelism across multiple GPUs.

Choosing a Local Model (2025)

Task	Recommended model	Size
General Q&A	Llama 3.3 70B (Q4)	~40GB
Code generation	Qwen2.5-Coder 7B	~4GB
Reasoning	DeepSeek-R1:7B	~4.7GB
Small device / fast	Phi-4-mini	~2.5GB
Embeddings (RAG)	nomic-embed-text	~275MB
Multilingual	Qwen2.5:7B	~4GB

Links to Refer

Ollama Documentation
HuggingFace Model Hub
GGUF format explained
MLX Documentation
vLLM Documentation
TheBloke on HuggingFace — huge library of quantized GGUF models