Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Retrieval Augmented Generation (RAG) enhances large language models by retrieving relevant external documents at query time, grounding responses in factual, up-to-date information without retraining the model. RAG addresses core LLM limitations — hallucinations, outdated knowledge, and lack of domain-specific data — making it the most widely deployed pattern for production AI agent systems.
RAG operates in three stages:
The simplest implementation: embed query, retrieve top-$k$ chunks by cosine similarity $\frac{\mathbf{q} \cdot \mathbf{d}}{||\mathbf{q}|| \cdot ||\mathbf{d}||}$, stuff into prompt, generate. Prone to retrieval noise, irrelevant chunks, and context overflow on complex queries.
Optimizes each stage of the pipeline:
Breaks the pipeline into interchangeable components for maximum flexibility:
Microsoft's GraphRAG builds a knowledge graph from documents, extracting entities and relationships, then uses graph traversal combined with vector search. This captures hierarchical context and entity connections that flat vector search misses, excelling on complex analytical queries.
| Strategy | Method | Best For |
| Fixed-size | Split by token/character count with overlap | Simple documents, fast implementation |
| Recursive | Hierarchical split (paragraphs, sentences, words) | Structured text with natural boundaries |
| Semantic | Group by embedding similarity | Topic-coherent chunks, mixed documents |
| Contextual | Prepend document-level context to each chunk | Preserving source context in retrieval |
| Agentic | LLM decides chunk boundaries | Complex documents requiring judgment |
from langchain_openai import OpenAIEmbeddings, ChatOpenAI from langchain_community.vectorstores import Chroma from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_cohere import CohereRerank from langchain.retrievers import ContextualCompressionRetriever # 1. Chunk documents with recursive splitting splitter = RecursiveCharacterTextSplitter( chunk_size=512, chunk_overlap=64, separators=["\n\n", "\n", ". ", " "] ) chunks = splitter.split_documents(documents) # 2. Embed and store in vector database embeddings = OpenAIEmbeddings(model="text-embedding-3-large") vectorstore = Chroma.from_documents(chunks, embeddings) # 3. Hybrid retrieval with reranking base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20}) reranker = CohereRerank(top_n=5) retriever = ContextualCompressionRetriever( base_compressor=reranker, base_retriever=base_retriever ) # 4. Generate with retrieved context llm = ChatOpenAI(model="gpt-4") docs = retriever.invoke("How does GraphRAG improve retrieval?") context = "\n".join(doc.page_content for doc in docs) response = llm.invoke(f"Context: {context}\n\nQuestion: How does GraphRAG improve retrieval?")
RAGAS (Retrieval Augmented Generation Assessment Suite) provides standard metrics for evaluating RAG pipelines: