Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) enhances large language models by retrieving relevant external documents at query time, grounding responses in factual, up-to-date information without retraining the model.¹⁾ RAG addresses core LLM limitations — hallucinations, outdated knowledge, and lack of domain-specific data — making it the most widely deployed pattern for production AI agent systems.²⁾ A key challenge in production RAG systems is that they frequently return incorrect answers with high confidence, requiring careful evaluation and feedback loops to ensure reliability.³⁾ When RAG data becomes massive, noisy, or contradictory in ways that break coherent context windows, this limitation can justify transitioning to multi-agent systems designed to filter and structure the retrieved information before generation.⁴⁾

Core RAG Pipeline

RAG operates in three stages:

Retrieval — A query is embedded into a vector $\mathbf{q} = E(\text{query})$ and used to search a knowledge base (vector database, keyword index, or hybrid) for the top-$k$ relevant document chunks by similarity $\text{sim}(\mathbf{q}, \mathbf{d}_i)$
Augmentation — Retrieved chunks are injected into the LLM prompt alongside the user query to provide grounding context
Generation — The LLM synthesizes a response $P(\text{answer} \mid \text{query}, d_1, d_2, \ldots, d_k)$ using both its training knowledge and the retrieved context

RAG Variants

Naive RAG

The simplest implementation: embed query, retrieve top-$k$ chunks by cosine similarity $\frac{\mathbf{q} \cdot \mathbf{d}}{||\mathbf{q}|| \cdot ||\mathbf{d}||}$, stuff into prompt, generate. Prone to retrieval noise, irrelevant chunks, and context overflow on complex queries.

Advanced RAG

Optimizes each stage of the pipeline:⁵⁾

Pre-retrieval — Query rewriting (HyDE, ITER-RETGEN), query expansion, and decomposition for complex questions
Retrieval — Hybrid search combining semantic vectors with BM25 keyword matching (which scores via $\text{BM25}(q, d) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{f(t,d) \cdot (k_1 + 1)}{f(t,d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})}$), plus fine-tuned embedding models
Post-retrieval — Reranking retrieved results (Cohere Rerank, cross-encoders), context compression, and deduplication

Modular RAG

Breaks the pipeline into interchangeable components for maximum flexibility:

Iterative retrieval — Multiple retrieval rounds that refine results (RETRO, GAR-meets-RAG)
Recursive retrieval — Multi-hop reasoning for questions requiring chain-of-thought across documents (IRCoT)
Adaptive retrieval — The system decides when retrieval is needed versus when the LLM can answer directly (Self-RAG, FLARE)

GraphRAG

github.io/graphrag/|Microsoft's GraphRAG]]⁶⁾.io/graphrag/|Microsoft GraphRAG]])) builds a knowledge graph from documents, extracting entities and relationships, then uses graph traversal combined with vector search.⁷⁾ This captures hierarchical context and entity connections that flat vector search misses, excelling on complex analytical queries.

Chunking Strategies

Strategy	Method	Best For
Fixed-size	Split by token/character count with overlap	Simple documents, fast implementation
Recursive	Hierarchical split (paragraphs, sentences, words)	Structured text with natural boundaries
Semantic	Group by embedding similarity	Topic-coherent chunks, mixed documents
Contextual	Prepend document-level context to each chunk	Preserving source context in retrieval
Agentic	LLM decides chunk boundaries	Complex documents requiring judgment

Example: Advanced RAG Pipeline

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from [[langchain|langchain]].text_splitter import RecursiveCharacterTextSplitter
from langchain_cohere import CohereRerank
from [[langchain|langchain]].retrievers import ContextualCompressionRetriever
 
1. Chunk documents with recursive splitting
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512, chunk_overlap=64,
    separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_documents(documents)
 
2. Embed and store in vector database
[[embeddings|embeddings]] = OpenAIEmbeddings(model="text-embedding-3-large")
vectorstore = Chroma.from_documents(chunks, [[embeddings|embeddings]])
 
3. Hybrid retrieval with [[reranking|reranking]]
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
reranker = CohereRerank(top_n=5)
retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=base_retriever
)
 
4. Generate with retrieved context
llm = ChatOpenAI(model="gpt-4")
docs = retriever.invoke("How does GraphRAG improve retrieval?")
context = "\n".join(doc.page_content for doc in docs)
response = llm.invoke(f"Context: {context}\n\nQuestion: How does GraphRAG improve retrieval?")

Evaluation with RAGAS

RAGAS⁸⁾ (Retrieval Augmented Generation Assessment Suite) provides standard metrics for evaluating RAG pipelines:⁹⁾

Faithfulness — Are generated claims supported by retrieved context? Measured as $\frac{|\text{supported claims}|}{|\text{total claims}|}$
Answer relevance — Does the response address the actual question?
Context precision — How much of the retrieved context is relevant? $\text{Precision@}k = \frac{|\text{relevant chunks in top-}k|}{k}$
Context recall — Were all necessary documents retrieved? $\text{Recall} = \frac{|\text{relevant chunks retrieved}|}{|\text{total relevant chunks}|}$