Table of Contents

Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) enhances large language models by retrieving relevant external documents at query time, grounding responses in factual, up-to-date information without retraining the model. RAG addresses core LLM limitations — hallucinations, outdated knowledge, and lack of domain-specific data — making it the most widely deployed pattern for production AI agent systems.

Core RAG Pipeline

RAG operates in three stages:

  1. Retrieval — A query is embedded into a vector $\mathbf{q} = E(\text{query})$ and used to search a knowledge base (vector database, keyword index, or hybrid) for the top-$k$ relevant document chunks by similarity $\text{sim}(\mathbf{q}, \mathbf{d}_i)$
  2. Augmentation — Retrieved chunks are injected into the LLM prompt alongside the user query to provide grounding context
  3. Generation — The LLM synthesizes a response $P(\text{answer} \mid \text{query}, d_1, d_2, \ldots, d_k)$ using both its training knowledge and the retrieved context

RAG Variants

Naive RAG

The simplest implementation: embed query, retrieve top-$k$ chunks by cosine similarity $\frac{\mathbf{q} \cdot \mathbf{d}}{||\mathbf{q}|| \cdot ||\mathbf{d}||}$, stuff into prompt, generate. Prone to retrieval noise, irrelevant chunks, and context overflow on complex queries.

Advanced RAG

Optimizes each stage of the pipeline:

Modular RAG

Breaks the pipeline into interchangeable components for maximum flexibility:

GraphRAG

Microsoft's GraphRAG builds a knowledge graph from documents, extracting entities and relationships, then uses graph traversal combined with vector search. This captures hierarchical context and entity connections that flat vector search misses, excelling on complex analytical queries.

Chunking Strategies

Strategy Method Best For
Fixed-size Split by token/character count with overlap Simple documents, fast implementation
Recursive Hierarchical split (paragraphs, sentences, words) Structured text with natural boundaries
Semantic Group by embedding similarity Topic-coherent chunks, mixed documents
Contextual Prepend document-level context to each chunk Preserving source context in retrieval
Agentic LLM decides chunk boundaries Complex documents requiring judgment

Example: Advanced RAG Pipeline

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_cohere import CohereRerank
from langchain.retrievers import ContextualCompressionRetriever
 
# 1. Chunk documents with recursive splitting
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512, chunk_overlap=64,
    separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_documents(documents)
 
# 2. Embed and store in vector database
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vectorstore = Chroma.from_documents(chunks, embeddings)
 
# 3. Hybrid retrieval with reranking
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
reranker = CohereRerank(top_n=5)
retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=base_retriever
)
 
# 4. Generate with retrieved context
llm = ChatOpenAI(model="gpt-4")
docs = retriever.invoke("How does GraphRAG improve retrieval?")
context = "\n".join(doc.page_content for doc in docs)
response = llm.invoke(f"Context: {context}\n\nQuestion: How does GraphRAG improve retrieval?")

Evaluation with RAGAS

RAGAS (Retrieval Augmented Generation Assessment Suite) provides standard metrics for evaluating RAG pipelines:

References

See Also