====== Retrieval Augmented Generation ====== Retrieval Augmented Generation (RAG) enhances large language models by retrieving relevant external documents at query time, grounding responses in factual, up-to-date information without retraining the model.((https://en.wikipedia.org/wiki/Retrieval-augmented_generation)) RAG addresses core LLM limitations — hallucinations, outdated knowledge, and lack of domain-specific data — making it the most widely deployed pattern for production AI agent systems.(([[https://www.promptingguide.ai/research/rag|Prompting Guide - RAG Research]])) A key challenge in production RAG systems is that they frequently return incorrect answers with high confidence, requiring careful evaluation and feedback loops to ensure reliability.(([[https://www.bensbites.com/p/my-cheatsheet-for-a-clean-context|Ben's Bites - Cheatsheet for Clean Context (2026]])) When RAG data becomes massive, noisy, or contradictory in ways that break coherent context windows, this limitation can justify transitioning to multi-agent systems designed to filter and structure the retrieved information before generation.(([[https://alphasignalai.substack.com/p/how-to-choose-between-single-and|AlphaSignal - How to Choose Between Single and Multi-Agent Systems (2026]])) ===== Core RAG Pipeline ===== RAG operates in three stages: - **Retrieval** — A query is embedded into a vector $\mathbf{q} = E(\text{query})$ and used to search a knowledge base (vector database, keyword index, or hybrid) for the top-$k$ relevant document chunks by similarity $\text{sim}(\mathbf{q}, \mathbf{d}_i)$ - **Augmentation** — Retrieved chunks are injected into the LLM prompt alongside the user query to provide grounding context - **Generation** — The LLM synthesizes a response $P(\text{answer} \mid \text{query}, d_1, d_2, \ldots, d_k)$ using both its training knowledge and the retrieved context ===== RAG Variants ===== ==== Naive RAG ==== The simplest implementation: embed query, retrieve top-$k$ chunks by cosine similarity $\frac{\mathbf{q} \cdot \mathbf{d}}{||\mathbf{q}|| \cdot ||\mathbf{d}||}$, stuff into prompt, generate. Prone to retrieval noise, irrelevant chunks, and context overflow on complex queries. ==== Advanced RAG ==== Optimizes each stage of the pipeline:((https://www.promptingguide.ai/research/rag)) * **Pre-retrieval** — Query rewriting (HyDE, ITER-RETGEN), query expansion, and decomposition for complex questions * **Retrieval** — [[hybrid_search|Hybrid search]] combining semantic vectors with BM25 keyword matching (which scores via $\text{BM25}(q, d) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{f(t,d) \cdot (k_1 + 1)}{f(t,d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})}$), plus fine-tuned embedding models * **Post-retrieval** — [[reranking|Reranking]] retrieved results ([[cohere|Cohere]] Rerank, cross-encoders), context compression, and deduplication ==== Modular RAG ==== Breaks the pipeline into interchangeable components for maximum flexibility: * **Iterative retrieval** — Multiple retrieval rounds that refine results (RETRO, GAR-meets-RAG) * **Recursive retrieval** — Multi-hop reasoning for questions requiring chain-of-thought across documents (IRCoT) * **Adaptive retrieval** — The system decides when retrieval is needed versus when the LLM can answer directly (Self-RAG, FLARE) ==== GraphRAG ==== [[https://microsoft.[[github|github]].io/graphrag/|Microsoft's GraphRAG]](([[https://microsoft.[[github|github]])).io/graphrag/|Microsoft GraphRAG]])) builds a knowledge graph from documents, extracting entities and relationships, then uses graph traversal combined with vector search.((https://microsoft.[[github|github]].io/graphrag/)) This captures hierarchical context and entity connections that flat vector search misses, excelling on complex analytical queries. ===== Chunking Strategies ===== | **Strategy** | **Method** | **Best For** | | Fixed-size | Split by token/character count with overlap | Simple documents, fast implementation | | Recursive | Hierarchical split (paragraphs, sentences, words) | Structured text with natural boundaries | | Semantic | Group by embedding similarity | Topic-coherent chunks, mixed documents | | Contextual | Prepend document-level context to each chunk | Preserving source context in retrieval | | Agentic | LLM decides chunk boundaries | Complex documents requiring judgment | ===== Example: Advanced RAG Pipeline ===== from langchain_openai import OpenAIEmbeddings, ChatOpenAI from langchain_community.vectorstores import Chroma from [[langchain|langchain]].text_splitter import RecursiveCharacterTextSplitter from langchain_cohere import CohereRerank from [[langchain|langchain]].retrievers import ContextualCompressionRetriever 1. Chunk documents with recursive splitting splitter = RecursiveCharacterTextSplitter( chunk_size=512, chunk_overlap=64, separators=["\n\n", "\n", ". ", " "] ) chunks = splitter.split_documents(documents) 2. Embed and store in vector database [[embeddings|embeddings]] = OpenAIEmbeddings(model="text-embedding-3-large") vectorstore = Chroma.from_documents(chunks, [[embeddings|embeddings]]) 3. Hybrid retrieval with [[reranking|reranking]] base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20}) reranker = CohereRerank(top_n=5) retriever = ContextualCompressionRetriever( base_compressor=reranker, base_retriever=base_retriever ) 4. Generate with retrieved context llm = ChatOpenAI(model="gpt-4") docs = retriever.invoke("How does GraphRAG improve retrieval?") context = "\n".join(doc.page_content for doc in docs) response = llm.invoke(f"Context: {context}\n\nQuestion: How does GraphRAG improve retrieval?") ===== Evaluation with RAGAS ===== [[https://docs.ragas.io/|RAGAS]](([[https://docs.ragas.io/|RAGAS Evaluation Framework]])) (Retrieval Augmented Generation Assessment Suite) provides standard metrics for evaluating RAG pipelines:((https://docs.ragas.io/)) * **Faithfulness** — Are generated claims supported by retrieved context? Measured as $\frac{|\text{supported claims}|}{|\text{total claims}|}$ * **Answer relevance** — Does the response address the actual question? * **Context precision** — How much of the retrieved context is relevant? $\text{Precision@}k = \frac{|\text{relevant chunks in top-}k|}{k}$ * **Context recall** — Were all necessary documents retrieved? $\text{Recall} = \frac{|\text{relevant chunks retrieved}|}{|\text{total relevant chunks}|}$ ===== See Also ===== * [[rag_in_ai|Retrieval-Augmented Generation (RAG) in AI]] * [[how_to_build_a_rag_pipeline|How to Build a RAG Pipeline]] * [[rag_system_production_deployment|RAG System Production Deployment]] * [[retrieval_strategies|Retrieval Strategies]] * [[rag_phases|Phases of a RAG System]] ===== References =====