8 min read
Advanced RAG Patterns
Why Basic RAG Fails
Basic RAG problems:
1. Retrieval misses — user asks "What's the refund policy?"
but doc says "Return & Exchange Guidelines" (semantic mismatch)
2. Lost in the middle — model ignores chunks in the middle of context
3. Conflicting chunks — retrieved docs contradict each other
4. No temporal awareness — old docs retrieved for current questions
5. Cross-document reasoning — answer requires combining 3+ docs
6. Short queries — "Why?" doesn't provide enough signal to retrieve wellPattern 1: HyDE (Hypothetical Document Embeddings)
Instead of embedding the query, generate a hypothetical answer and embed that.
pythonfrom langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
def hyde_retrieve(query: str, vectorstore, k: int = 5) -> list:
# Step 1: Generate a hypothetical answer
hyde_prompt = ChatPromptTemplate.from_template("""
Write a paragraph that would directly answer this question.
Be specific and detailed. This is a hypothetical answer for search purposes.
Question: {question}
Hypothetical Answer:""")
hypothetical_answer = llm.invoke(hyde_prompt.format(question=query)).content
# Step 2: Embed the hypothetical answer (not the query!)
hyp_embedding = embeddings.embed_query(hypothetical_answer)
# Step 3: Search using the hypothetical embedding
docs = vectorstore.similarity_search_by_vector(hyp_embedding, k=k)
return docs
# Why it works:
# Short query "refund policy" → embedding is sparse
# Hypothetical answer "Customers may return items within 30 days..."
# → embedding is dense and close to the actual policy documentPattern 2: Reranking
Retrieve more candidates, then rerank them with a cross-encoder for precision.
pythonfrom sentence_transformers import CrossEncoder
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
# Two-stage retrieval:
# Stage 1: Fast approximate search (top-20 candidates)
# Stage 2: Slow but accurate reranking (pick top-5)
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def retrieve_and_rerank(query: str, vectorstore, initial_k=20, final_k=5) -> list:
# Stage 1: Retrieve 20 candidates (recall-optimized)
candidates = vectorstore.similarity_search(query, k=initial_k)
# Stage 2: Score each candidate against the query
pairs = [(query, doc.page_content) for doc in candidates]
scores = reranker.predict(pairs)
# Stage 3: Sort by score, return top-5 (precision-optimized)
ranked = sorted(zip(scores, candidates), key=lambda x: x[0], reverse=True)
return [doc for _, doc in ranked[:final_k]]
# Cohere reranker (API-based, higher quality):
import cohere
co = cohere.Client(api_key="...")
def cohere_rerank(query: str, docs: list, top_n: int = 5) -> list:
texts = [doc.page_content for doc in docs]
results = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=texts,
top_n=top_n,
)
return [docs[r.index] for r in results.results]Pattern 3: Hybrid Search (Dense + Sparse)
Combine semantic search (vectors) with keyword search (BM25) for best of both worlds.
pythonfrom langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
# Dense retriever (semantic — finds conceptually similar docs)
vectorstore = Chroma(embedding_function=OpenAIEmbeddings())
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
# Sparse retriever (keyword — finds exact term matches)
bm25_retriever = BM25Retriever.from_documents(all_documents, k=10)
# Hybrid: combine both with Reciprocal Rank Fusion
hybrid_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, dense_retriever],
weights=[0.4, 0.6], # Slightly favor semantic
)
results = hybrid_retriever.invoke("JWT token expiration")
# Dense: finds docs about "authentication" and "session management"
# Sparse: finds docs containing exact string "JWT" and "expiration"
# Hybrid: best of both
# pgvector hybrid search (production-grade):
# SELECT id, content, embedding <=> $1 AS semantic_distance
# FROM documents
# WHERE to_tsvector('english', content) @@ plainto_tsquery($2) -- BM25 filter
# ORDER BY semantic_distance
# LIMIT 10;Pattern 4: Self-Query / Metadata Filtering
Let the LLM generate structured filters from natural language queries.
pythonfrom langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
from langchain_openai import ChatOpenAI
from langchain_chroma import Chroma
# Define metadata schema
metadata_field_info = [
AttributeInfo(name="source", description="Document source file", type="string"),
AttributeInfo(name="date", description="Document date (YYYY-MM-DD)", type="string"),
AttributeInfo(name="category", description="Category: policy, technical, hr, legal", type="string"),
AttributeInfo(name="department", description="Owning department", type="string"),
]
retriever = SelfQueryRetriever.from_llm(
llm=ChatOpenAI(model="gpt-4o-mini", temperature=0),
vectorstore=vectorstore,
document_contents="Internal company documents",
metadata_field_info=metadata_field_info,
)
# User asks: "What are the HR policies from 2024?"
# LLM generates filter: {"category": "hr", "date": {"$gte": "2024-01-01"}}
# Retriever applies filter before semantic search
results = retriever.invoke("What are the HR policies from 2024?")
# Returns only HR docs from 2024, not all docs about policiesPattern 5: Contextual Compression
Compress retrieved chunks to only include the relevant parts.
pythonfrom langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI
# Problem: retrieved chunk is 500 tokens but only 50 tokens are relevant
# Solution: LLM extracts only the relevant portion
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
)
# Before compression: returns 5 full chunks (~2500 tokens total)
# After compression: returns only relevant sentences from each chunk (~300 tokens)
docs = compression_retriever.invoke("What is the cancellation policy?")Pattern 6: Multi-Query Retrieval
Generate multiple reformulations of the query to catch more relevant docs.
pythonfrom langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI
retriever = MultiQueryRetriever.from_llm(
retriever=vectorstore.as_retriever(),
llm=ChatOpenAI(model="gpt-4o-mini", temperature=0.3),
)
# User: "How do I cancel my subscription?"
# Generates:
# 1. "subscription cancellation process"
# 2. "how to end my plan"
# 3. "unsubscribe account termination"
# Retrieves for ALL three, deduplicates results
# Finds more relevant docs than any single query
results = retriever.invoke("How do I cancel my subscription?")Pattern 7: RAPTOR (Recursive Abstractive Processing)
Build a hierarchy of summaries for multi-level retrieval.
python# RAPTOR = bottom-up: summarize clusters of docs, then summarize those summaries
# Creates a "tree" of information at different abstraction levels
# Level 0: Raw documents (specific details)
# Level 1: Cluster summaries (topic-level)
# Level 2: Section summaries (high-level)
# Level 3: Document-level summary (overview)
# At query time: retrieve from ALL levels
# "Give me a high-level overview" → hits Level 2-3
# "What's the exact threshold for X?" → hits Level 0
# Implementation sketch:
from sklearn.cluster import KMeans
import numpy as np
def build_raptor_tree(documents: list, levels: int = 3):
current_docs = documents
for level in range(levels):
# 1. Embed current docs
embeddings = embed_documents(current_docs)
# 2. Cluster into N groups
n_clusters = max(2, len(current_docs) // 5)
kmeans = KMeans(n_clusters=n_clusters)
labels = kmeans.fit_predict(embeddings)
# 3. Summarize each cluster
summaries = []
for cluster_id in range(n_clusters):
cluster_docs = [d for d, l in zip(current_docs, labels) if l == cluster_id]
combined = "\n\n".join([d.page_content for d in cluster_docs])
summary = llm.invoke(f"Summarize these documents: {combined[:4000]}")
summaries.append(Document(page_content=summary.content,
metadata={"level": level + 1}))
# 4. Add summaries to vector store
vectorstore.add_documents(summaries)
current_docs = summaries # Next level summarizes the summariesPattern 8: Corrective RAG (CRAG)
Evaluate retrieved docs and fall back to web search if they're not good enough.
pythonfrom langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_community.tools.tavily_search import TavilySearchResults
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
def grade_document(query: str, document: str) -> str:
"""Grade if document is relevant to the query. Returns 'relevant' or 'irrelevant'."""
grading_prompt = ChatPromptTemplate.from_template("""
Is this document relevant to answering the question?
Answer only 'relevant' or 'irrelevant'.
Question: {question}
Document: {document}
""")
result = llm.invoke(grading_prompt.format(question=query, document=document))
return result.content.strip().lower()
def corrective_rag(query: str, vectorstore) -> str:
# Step 1: Retrieve docs
docs = vectorstore.similarity_search(query, k=4)
# Step 2: Grade each doc
relevant_docs = []
has_irrelevant = False
for doc in docs:
grade = grade_document(query, doc.page_content)
if grade == "relevant":
relevant_docs.append(doc)
else:
has_irrelevant = True
# Step 3: If docs are poor quality, supplement with web search
if not relevant_docs or has_irrelevant:
web_search = TavilySearchResults(max_results=3)
web_results = web_search.invoke(query)
web_docs = [Document(page_content=r["content"]) for r in web_results]
relevant_docs.extend(web_docs)
# Step 4: Generate answer from best available context
context = "\n\n".join([d.page_content for d in relevant_docs])
return llm.invoke(f"Answer based on context:\n{context}\n\nQuestion: {query}").contentAdvanced Chunking Strategies
python# 1. Semantic chunking (splits at topic boundaries, not fixed size)
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
splitter = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95, # Split when similarity drops below 95th percentile
)
docs = splitter.split_text(long_document)
# 2. Document-aware chunking (respects headers, sections)
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "h1"),
("##", "h2"),
("###", "h3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
docs = splitter.split_text(markdown_content)
# Each chunk has metadata: {"h1": "Chapter 3", "h2": "Section 2.1"}
# 3. Sentence window retrieval (embed sentences, retrieve with surrounding context)
# Index: embed individual sentences
# Retrieve: return sentence + N sentences before/after for context
def sentence_window_retrieval(query, vectorstore, window=2):
# Find the most relevant sentence
relevant = vectorstore.similarity_search(query, k=3)
expanded = []
for doc in relevant:
# Expand with surrounding context from original document
start = max(0, doc.metadata["sentence_idx"] - window)
end = doc.metadata["sentence_idx"] + window + 1
window_text = " ".join(all_sentences[start:end])
expanded.append(Document(page_content=window_text))
return expanded
# 4. Late chunking (embed whole doc first, then chunk the embeddings)
# Preserves full document context in each chunk's embedding
# Better than chunking then embedding (each chunk is context-aware)
# Requires a model that supports it (Jina AI, some Cohere models)RAG Evaluation Deep Dive (RAGAS)
pythonfrom ragas import evaluate
from ragas.metrics import (
faithfulness, # Is answer grounded in retrieved context?
answer_relevancy, # Does answer address the question?
context_precision, # Are retrieved chunks actually relevant?
context_recall, # Did we retrieve all needed info? (needs ground truth)
context_entity_recall, # Did we retrieve the right entities?
answer_correctness, # Is the answer factually correct? (needs ground truth)
)
from datasets import Dataset
# Build evaluation dataset
eval_data = {
"question": ["What is the refund policy?", "How do I reset my password?"],
"answer": ["You can refund within 30 days.", "Click Forgot Password on login."],
"contexts": [
["Refund Policy: All items can be returned within 30 days..."],
["To reset: go to login page, click Forgot Password, enter email..."],
],
"ground_truth": [ # Optional but enables more metrics
"Items can be returned within 30 days for a full refund.",
"Use the Forgot Password link on the login page.",
],
}
dataset = Dataset.from_dict(eval_data)
results = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print(results)
# {
# "faithfulness": 0.92, # 92% of claims are in the context
# "answer_relevancy": 0.88, # 88% of answer addresses the question
# "context_precision": 0.85, # 85% of retrieved chunks are relevant
# "context_recall": 0.79, # Missed 21% of needed information
# }
# Low faithfulness → LLM is hallucinating despite good context
# Low context_precision → retrieval returning too much noise
# Low context_recall → missing relevant chunks (chunking/embedding issue)
# Low answer_relevancy → answer is off-topic despite good retrievalProduction RAG Architecture
USER QUERY
│
┌─────────▼──────────┐
│ Query Analysis │ ← Classify: simple / complex / out-of-scope
└─────────┬──────────┘
│
┌─────────▼──────────┐
│ Query Expansion │ ← Multi-query, HyDE, spelling correction
└─────────┬──────────┘
│
┌────────────▼────────────┐
│ Hybrid Retrieval │ ← Dense + Sparse, metadata filter
│ (initial k=20 docs) │
└────────────┬────────────┘
│
┌─────────▼──────────┐
│ Reranking │ ← Cross-encoder, pick top-5
└─────────┬──────────┘
│
┌─────────▼──────────┐
│ Context Assembly │ ← Dedup, sort by relevance, format
└─────────┬──────────┘
│
┌─────────▼──────────┐
│ LLM Generation │ ← With citations, streaming
└─────────┬──────────┘
│
┌─────────▼──────────┐
│ Response Check │ ← Faithfulness check (optional)
└─────────┬──────────┘
│
RESPONSE + CITATIONSTricky RAG Interview Scenarios
"Your RAG retrieves correctly but the LLM still gives wrong answers"
Causes:
1. Context too long → lost in the middle (important info in middle is ignored)
Fix: put most relevant chunk FIRST and LAST in context
2. Conflicting chunks → LLM hedges or picks wrong one
Fix: deduplicate, or ask LLM to resolve conflicts explicitly
3. System prompt overriding context → trust issues
Fix: explicit "Answer ONLY from the provided context below"
4. LLM training data conflicts with retrieved content
Fix: stronger instruction: "Your training knowledge is outdated, use ONLY this context""RAG works in testing but fails in production with real user queries"
Distribution shift: test queries ≠ real queries
- Test with a representative sample of real queries
- Use LangSmith to capture and analyze production traces
- A/B test retrieval strategies in production
- Monitor context_precision metric over time[prev·next]