====== Retrieval Augmented Generation ====== Retrieval Augmented Generation (RAG) enhances large language models by retrieving relevant external documents at query time, grounding responses in factual, up-to-date information without retraining the model. RAG addresses core LLM limitations — hallucinations, outdated knowledge, and lack of domain-specific data — making it the most widely deployed pattern for production AI agent systems. ===== Core RAG Pipeline ===== RAG operates in three stages: - **Retrieval** — A query is embedded into a vector $\mathbf{q} = E(\text{query})$ and used to search a knowledge base (vector database, keyword index, or hybrid) for the top-$k$ relevant document chunks by similarity $\text{sim}(\mathbf{q}, \mathbf{d}_i)$ - **Augmentation** — Retrieved chunks are injected into the LLM prompt alongside the user query to provide grounding context - **Generation** — The LLM synthesizes a response $P(\text{answer} \mid \text{query}, d_1, d_2, \ldots, d_k)$ using both its training knowledge and the retrieved context ===== RAG Variants ===== ==== Naive RAG ==== The simplest implementation: embed query, retrieve top-$k$ chunks by cosine similarity $\frac{\mathbf{q} \cdot \mathbf{d}}{||\mathbf{q}|| \cdot ||\mathbf{d}||}$, stuff into prompt, generate. Prone to retrieval noise, irrelevant chunks, and context overflow on complex queries. ==== Advanced RAG ==== Optimizes each stage of the pipeline: * **Pre-retrieval** — Query rewriting (HyDE, ITER-RETGEN), query expansion, and decomposition for complex questions * **Retrieval** — Hybrid search combining semantic vectors with BM25 keyword matching (which scores via $\text{BM25}(q, d) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{f(t,d) \cdot (k_1 + 1)}{f(t,d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})}$), plus fine-tuned embedding models * **Post-retrieval** — Reranking retrieved results (Cohere Rerank, cross-encoders), context compression, and deduplication ==== Modular RAG ==== Breaks the pipeline into interchangeable components for maximum flexibility: * **Iterative retrieval** — Multiple retrieval rounds that refine results (RETRO, GAR-meets-RAG) * **Recursive retrieval** — Multi-hop reasoning for questions requiring chain-of-thought across documents (IRCoT) * **Adaptive retrieval** — The system decides when retrieval is needed versus when the LLM can answer directly (Self-RAG, FLARE) ==== GraphRAG ==== [[https://microsoft.github.io/graphrag/|Microsoft's GraphRAG]] builds a knowledge graph from documents, extracting entities and relationships, then uses graph traversal combined with vector search. This captures hierarchical context and entity connections that flat vector search misses, excelling on complex analytical queries. ===== Chunking Strategies ===== | **Strategy** | **Method** | **Best For** | | Fixed-size | Split by token/character count with overlap | Simple documents, fast implementation | | Recursive | Hierarchical split (paragraphs, sentences, words) | Structured text with natural boundaries | | Semantic | Group by embedding similarity | Topic-coherent chunks, mixed documents | | Contextual | Prepend document-level context to each chunk | Preserving source context in retrieval | | Agentic | LLM decides chunk boundaries | Complex documents requiring judgment | ===== Example: Advanced RAG Pipeline ===== from langchain_openai import OpenAIEmbeddings, ChatOpenAI from langchain_community.vectorstores import Chroma from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_cohere import CohereRerank from langchain.retrievers import ContextualCompressionRetriever # 1. Chunk documents with recursive splitting splitter = RecursiveCharacterTextSplitter( chunk_size=512, chunk_overlap=64, separators=["\n\n", "\n", ". ", " "] ) chunks = splitter.split_documents(documents) # 2. Embed and store in vector database embeddings = OpenAIEmbeddings(model="text-embedding-3-large") vectorstore = Chroma.from_documents(chunks, embeddings) # 3. Hybrid retrieval with reranking base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20}) reranker = CohereRerank(top_n=5) retriever = ContextualCompressionRetriever( base_compressor=reranker, base_retriever=base_retriever ) # 4. Generate with retrieved context llm = ChatOpenAI(model="gpt-4") docs = retriever.invoke("How does GraphRAG improve retrieval?") context = "\n".join(doc.page_content for doc in docs) response = llm.invoke(f"Context: {context}\n\nQuestion: How does GraphRAG improve retrieval?") ===== Evaluation with RAGAS ===== [[https://docs.ragas.io/|RAGAS]] (Retrieval Augmented Generation Assessment Suite) provides standard metrics for evaluating RAG pipelines: * **Faithfulness** — Are generated claims supported by retrieved context? Measured as $\frac{|\text{supported claims}|}{|\text{total claims}|}$ * **Answer relevance** — Does the response address the actual question? * **Context precision** — How much of the retrieved context is relevant? $\text{Precision@}k = \frac{|\text{relevant chunks in top-}k|}{k}$ * **Context recall** — Were all necessary documents retrieved? $\text{Recall} = \frac{|\text{relevant chunks retrieved}|}{|\text{total relevant chunks}|}$ ===== References ===== * [[https://www.promptingguide.ai/research/rag|Prompting Guide - RAG Research]] * [[https://microsoft.github.io/graphrag/|Microsoft GraphRAG]] * [[https://docs.ragas.io/|RAGAS Evaluation Framework]] * [[https://en.wikipedia.org/wiki/Retrieval-augmented_generation|Wikipedia - Retrieval Augmented Generation]] ===== See Also ===== * [[embeddings]] — Embedding models that power RAG retrieval * [[knowledge_graphs]] — Graph-based retrieval with GraphRAG * [[agent_memory_frameworks]] — Memory systems that build on RAG patterns * [[vector_databases]] — Storage infrastructure for RAG