====== Retrieval Augmented Generation ======
Retrieval Augmented Generation (RAG) enhances large language models by retrieving relevant external documents at query time, grounding responses in factual, up-to-date information without retraining the model.((https://en.wikipedia.org/wiki/Retrieval-augmented_generation)) RAG addresses core LLM limitations — hallucinations, outdated knowledge, and lack of domain-specific data — making it the most widely deployed pattern for production AI agent systems.(([[https://www.promptingguide.ai/research/rag|Prompting Guide - RAG Research]])) A key challenge in production RAG systems is that they frequently return incorrect answers with high confidence, requiring careful evaluation and feedback loops to ensure reliability.(([[https://www.bensbites.com/p/my-cheatsheet-for-a-clean-context|Ben's Bites - Cheatsheet for Clean Context (2026]])) When RAG data becomes massive, noisy, or contradictory in ways that break coherent context windows, this limitation can justify transitioning to multi-agent systems designed to filter and structure the retrieved information before generation.(([[https://alphasignalai.substack.com/p/how-to-choose-between-single-and|AlphaSignal - How to Choose Between Single and Multi-Agent Systems (2026]])) 

===== Core RAG Pipeline =====
RAG operates in three stages:

  - **Retrieval** — A query is embedded into a vector $\mathbf{q} = E(\text{query})$ and used to search a knowledge base (vector database, keyword index, or hybrid) for the top-$k$ relevant document chunks by similarity $\text{sim}(\mathbf{q}, \mathbf{d}_i)$
  - **Augmentation** — Retrieved chunks are injected into the LLM prompt alongside the user query to provide grounding context
  - **Generation** — The LLM synthesizes a response $P(\text{answer} \mid \text{query}, d_1, d_2, \ldots, d_k)$ using both its training knowledge and the retrieved context

===== RAG Variants =====
==== Naive RAG ====
The simplest implementation: embed query, retrieve top-$k$ chunks by cosine similarity $\frac{\mathbf{q} \cdot \mathbf{d}}{||\mathbf{q}|| \cdot ||\mathbf{d}||}$, stuff into prompt, generate. Prone to retrieval noise, irrelevant chunks, and context overflow on complex queries.

==== Advanced RAG ====
Optimizes each stage of the pipeline:((https://www.promptingguide.ai/research/rag))

  * **Pre-retrieval** — Query rewriting (HyDE, ITER-RETGEN), query expansion, and decomposition for complex questions
  * **Retrieval** — [[hybrid_search|Hybrid search]] combining semantic vectors with BM25 keyword matching (which scores via $\text{BM25}(q, d) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{f(t,d) \cdot (k_1 + 1)}{f(t,d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})}$), plus fine-tuned embedding models
  * **Post-retrieval** — [[reranking|Reranking]] retrieved results ([[cohere|Cohere]] Rerank, cross-encoders), context compression, and deduplication

==== Modular RAG ====
Breaks the pipeline into interchangeable components for maximum flexibility:

  * **Iterative retrieval** — Multiple retrieval rounds that refine results (RETRO, GAR-meets-RAG)
  * **Recursive retrieval** — Multi-hop reasoning for questions requiring chain-of-thought across documents (IRCoT)
  * **Adaptive retrieval** — The system decides when retrieval is needed versus when the LLM can answer directly (Self-RAG, FLARE)

==== GraphRAG ====
[[https://microsoft.[[github|github]].io/graphrag/|Microsoft's GraphRAG]](([[https://microsoft.[[github|github]])).io/graphrag/|Microsoft GraphRAG]])) builds a knowledge graph from documents, extracting entities and relationships, then uses graph traversal combined with vector search.((https://microsoft.[[github|github]].io/graphrag/)) This captures hierarchical context and entity connections that flat vector search misses, excelling on complex analytical queries.

===== Chunking Strategies =====
| **Strategy** | **Method** | **Best For** |
| Fixed-size | Split by token/character count with overlap | Simple documents, fast implementation |
| Recursive | Hierarchical split (paragraphs, sentences, words) | Structured text with natural boundaries |
| Semantic | Group by embedding similarity | Topic-coherent chunks, mixed documents |
| Contextual | Prepend document-level context to each chunk | Preserving source context in retrieval |
| Agentic | LLM decides chunk boundaries | Complex documents requiring judgment |

===== Example: Advanced RAG Pipeline =====
<code python>
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from [[langchain|langchain]].text_splitter import RecursiveCharacterTextSplitter
from langchain_cohere import CohereRerank
from [[langchain|langchain]].retrievers import ContextualCompressionRetriever

1. Chunk documents with recursive splitting
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512, chunk_overlap=64,
    separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_documents(documents)

2. Embed and store in vector database
[[embeddings|embeddings]] = OpenAIEmbeddings(model="text-embedding-3-large")
vectorstore = Chroma.from_documents(chunks, [[embeddings|embeddings]])

3. Hybrid retrieval with [[reranking|reranking]]
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
reranker = CohereRerank(top_n=5)
retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=base_retriever
)

4. Generate with retrieved context
llm = ChatOpenAI(model="gpt-4")
docs = retriever.invoke("How does GraphRAG improve retrieval?")
context = "\n".join(doc.page_content for doc in docs)
response = llm.invoke(f"Context: {context}\n\nQuestion: How does GraphRAG improve retrieval?")
</code>

===== Evaluation with RAGAS =====
[[https://docs.ragas.io/|RAGAS]](([[https://docs.ragas.io/|RAGAS Evaluation Framework]])) (Retrieval Augmented Generation Assessment Suite) provides standard metrics for evaluating RAG pipelines:((https://docs.ragas.io/))

  * **Faithfulness** — Are generated claims supported by retrieved context? Measured as $\frac{|\text{supported claims}|}{|\text{total claims}|}$
  * **Answer relevance** — Does the response address the actual question?
  * **Context precision** — How much of the retrieved context is relevant? $\text{Precision@}k = \frac{|\text{relevant chunks in top-}k|}{k}$
  * **Context recall** — Were all necessary documents retrieved? $\text{Recall} = \frac{|\text{relevant chunks retrieved}|}{|\text{total relevant chunks}|}$

===== See Also =====
  * [[rag_in_ai|Retrieval-Augmented Generation (RAG) in AI]]
  * [[how_to_build_a_rag_pipeline|How to Build a RAG Pipeline]]
  * [[rag_system_production_deployment|RAG System Production Deployment]]
  * [[retrieval_strategies|Retrieval Strategies]]
  * [[rag_phases|Phases of a RAG System]]

===== References =====