====== Phases of a RAG System ======

Retrieval-Augmented Generation (RAG) is a system design pattern that enhances large language models by connecting them to external knowledge bases at query time. ((source [[https://www.ibm.com/think/topics/retrieval-augmented-generation|IBM - What is RAG]])) A RAG pipeline operates through three interconnected phases: **indexing**, **retrieval**, and **generation**. Each phase addresses a distinct part of the workflow, transforming raw data into accurate, grounded AI responses. ((source [[https://aws.amazon.com/what-is/retrieval-augmented-generation/|AWS - What is RAG]]))

===== Indexing Phase =====

The indexing phase occurs **offline**, before any user query is processed. Its purpose is to prepare external data sources for rapid, semantically meaningful retrieval. ((source [[https://developer.nvidia.com/blog/rag-101-demystifying-retrieval-augmented-generation-pipelines/|NVIDIA - RAG 101]]))

==== Document Preparation and Chunking ====

Raw documents such as PDFs, web pages, database records, and internal knowledge bases are loaded and split into smaller, manageable pieces called **chunks**. Chunking ensures that retrieval can target specific, relevant passages rather than entire documents. ((source [[https://www.databricks.com/blog/what-is-retrieval-augmented-generation|Databricks - What is RAG]])) Common strategies include fixed-size splitting (typically 500-2000 tokens), recursive splitting by document structure, and semantic chunking that preserves meaning boundaries. ((source [[https://mbrenndoerfer.com/writing/document-chunking-rag-strategies-retrieval|Brenndoerfer - Document Chunking for RAG]]))

==== Embedding Generation ====

Each chunk is converted into a **dense vector embedding** -- a fixed-length numerical array (commonly 768-1536 dimensions) that captures the semantic meaning of the text. ((source [[https://en.wikipedia.org/wiki/Retrieval-augmented_generation|Wikipedia - Retrieval-augmented generation]])) Pre-trained transformer models such as BERT, Sentence-BERT, or OpenAI text-embedding-ada-002 perform this conversion. These embeddings encode contextual relationships so that semantically similar content maps to nearby points in vector space. ((source [[https://www.weka.io/learn/guide/ai-ml/retrieval-augmented-generation/|WEKA - RAG Guide]]))

==== Vector Database Storage ====

The resulting embeddings, along with the original text chunks and associated metadata (source file, page number, timestamps), are stored in a **vector database** such as FAISS, Pinecone, Milvus, or Qdrant. ((source [[https://qdrant.tech/articles/what-is-rag-in-ai/|Qdrant - What is RAG in AI]])) These databases use indexing algorithms like HNSW (Hierarchical Navigable Small World) or IVF (Inverted File) to enable fast approximate nearest neighbor (ANN) searches across millions of vectors. ((source [[https://medium.com/@tararoutray/the-architecture-behind-vector-databases-in-modern-ai-systems-17a6c8a19095|Routray - Vector Database Architecture]]))

Because indexing is a one-time upfront cost, RAG is a scalable and cost-effective solution -- data sources can be updated without retraining the underlying LLM. ((source [[https://www.databricks.com/blog/what-is-retrieval-augmented-generation|Databricks - What is RAG]]))

===== Retrieval Phase =====

When a user submits a query, the retrieval phase activates to find the most relevant information from the indexed knowledge base. ((source [[https://aws.amazon.com/what-is/retrieval-augmented-generation/|AWS - What is RAG]]))

==== Query Embedding ====

The user query is converted into a vector representation using the **same embedding model** that was applied during indexing, ensuring both queries and documents exist in the same semantic space. ((source [[https://www.aimon.ai/posts/rag_and_its_different_components/|AIMon - RAG Components]]))

==== Similarity Search ====

The query vector is compared against stored document vectors using distance metrics such as **cosine similarity** or dot product. Approximate Nearest Neighbor (ANN) algorithms retrieve the top-K most semantically relevant chunks efficiently, even across billions of vectors. ((source [[https://www.aimon.ai/posts/rag_and_its_different_components/|AIMon - RAG Components]]))

==== Re-ranking and Filtering ====

Initial retrieval results may be further refined through **re-ranking** using cross-encoder models that score query-chunk pairs for deeper relevance assessment. Metadata filters (date ranges, source authority, document type) can further narrow results to the most precise context. ((source [[https://medium.com/@ashutoshkumars1ngh/hybrid-search-done-right-fixing-rag-retrieval-failures-using-bm25-hnsw-reciprocal-rank-fusion-a73596652d22|Ashutoshkumarsingh - Hybrid Search Done Right]]))

==== Hybrid Retrieval ====

Advanced RAG systems combine **dense retrieval** (semantic vector similarity) with **sparse retrieval** (keyword-based methods like BM25 or TF-IDF). Scores from both approaches are fused using techniques such as Reciprocal Rank Fusion (RRF), leveraging semantic understanding for meaning and lexical precision for exact term matching. ((source [[https://insights.appetenza.com/hybrid-search-and-re-ranking-how-to-dramatically-improve-rag-retrieval-accuracy|Appetenza - Hybrid Search and Re-Ranking]]))

===== Generation Phase =====

The generation phase synthesizes retrieved information with the original query to produce a grounded, contextual response. ((source [[https://www.ibm.com/think/topics/retrieval-augmented-generation|IBM - What is RAG]]))

==== Prompt Augmentation ====

The original user prompt is enriched with the top-ranked retrieved chunks to form an **augmented prompt**. A typical template structures the input as context passages followed by the user question with explicit instructions to answer only from the provided context. ((source [[https://www.databricks.com/blog/what-is-retrieval-augmented-generation|Databricks - What is RAG]])) Techniques like few-shot prompting, context compression, and attention masking further optimize how the LLM processes the augmented input. ((source [[https://www.intersystems.com/resources/retrieval-augmented-generation/|InterSystems - RAG]]))

==== LLM Synthesis ====

The augmented prompt is submitted to the large language model, which uses both its training data and the retrieved external knowledge to generate a response. ((source [[https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/|NVIDIA - What is RAG]])) The LLM synthesizes information from multiple retrieved passages into a coherent, natural-language answer, optionally citing sources for transparency and verifiability. ((source [[https://www.k2view.com/what-is-retrieval-augmented-generation|K2View - What is RAG]]))

==== Post-Processing ====

The generated response may undergo optional post-processing steps including factual verification against retrieved sources, formatting adjustments, PII detection, and hallucination checks before delivery to the user. ((source [[https://www.weka.io/learn/guide/ai-ml/retrieval-augmented-generation/|WEKA - RAG Guide]]))

===== See Also =====

  * [[retrieval_augmented_generation|Retrieval-Augmented Generation]]
  * [[how_to_build_a_rag_pipeline|How to Build a RAG Pipeline]]
  * [[agentic_rag|Agentic RAG]]
  * [[vector_db_comparison|Vector Database Comparison]]
  * [[rag_ingestion_phase|What Happens During the Ingestion Phase of RAG]]
  * [[rag_retrieval_phase|How Does the Retrieval Phase Work in RAG]]
  * [[ai_rag_how_it_works|How Does AI RAG Work]]

===== References =====