Phases of a RAG System

Retrieval-Augmented Generation (RAG) is a system design pattern that enhances large language models by connecting them to external knowledge bases at query time. ¹⁾ A RAG pipeline operates through three interconnected phases: indexing, retrieval, and generation. Each phase addresses a distinct part of the workflow, transforming raw data into accurate, grounded AI responses. ²⁾

Indexing Phase

The indexing phase occurs offline, before any user query is processed. Its purpose is to prepare external data sources for rapid, semantically meaningful retrieval. ³⁾

Document Preparation and Chunking

Raw documents such as PDFs, web pages, database records, and internal knowledge bases are loaded and split into smaller, manageable pieces called chunks. Chunking ensures that retrieval can target specific, relevant passages rather than entire documents. ⁴⁾ Common strategies include fixed-size splitting (typically 500-2000 tokens), recursive splitting by document structure, and semantic chunking that preserves meaning boundaries. ⁵⁾

Embedding Generation

Each chunk is converted into a dense vector embedding – a fixed-length numerical array (commonly 768-1536 dimensions) that captures the semantic meaning of the text. ⁶⁾ Pre-trained transformer models such as BERT, Sentence-BERT, or OpenAI text-embedding-ada-002 perform this conversion. These embeddings encode contextual relationships so that semantically similar content maps to nearby points in vector space. ⁷⁾

Vector Database Storage

The resulting embeddings, along with the original text chunks and associated metadata (source file, page number, timestamps), are stored in a vector database such as FAISS, Pinecone, Milvus, or Qdrant. ⁸⁾ These databases use indexing algorithms like HNSW (Hierarchical Navigable Small World) or IVF (Inverted File) to enable fast approximate nearest neighbor (ANN) searches across millions of vectors. ⁹⁾

Because indexing is a one-time upfront cost, RAG is a scalable and cost-effective solution – data sources can be updated without retraining the underlying LLM. ¹⁰⁾

Retrieval Phase

When a user submits a query, the retrieval phase activates to find the most relevant information from the indexed knowledge base. ¹¹⁾

Query Embedding

The user query is converted into a vector representation using the same embedding model that was applied during indexing, ensuring both queries and documents exist in the same semantic space. ¹²⁾

Similarity Search

The query vector is compared against stored document vectors using distance metrics such as cosine similarity or dot product. Approximate Nearest Neighbor (ANN) algorithms retrieve the top-K most semantically relevant chunks efficiently, even across billions of vectors. ¹³⁾

Re-ranking and Filtering

Initial retrieval results may be further refined through re-ranking using cross-encoder models that score query-chunk pairs for deeper relevance assessment. Metadata filters (date ranges, source authority, document type) can further narrow results to the most precise context. ¹⁴⁾

Hybrid Retrieval

Advanced RAG systems combine dense retrieval (semantic vector similarity) with sparse retrieval (keyword-based methods like BM25 or TF-IDF). Scores from both approaches are fused using techniques such as Reciprocal Rank Fusion (RRF), leveraging semantic understanding for meaning and lexical precision for exact term matching. ¹⁵⁾

Generation Phase

The generation phase synthesizes retrieved information with the original query to produce a grounded, contextual response. ¹⁶⁾

Prompt Augmentation

The original user prompt is enriched with the top-ranked retrieved chunks to form an augmented prompt. A typical template structures the input as context passages followed by the user question with explicit instructions to answer only from the provided context. ¹⁷⁾ Techniques like few-shot prompting, context compression, and attention masking further optimize how the LLM processes the augmented input. ¹⁸⁾

LLM Synthesis

The augmented prompt is submitted to the large language model, which uses both its training data and the retrieved external knowledge to generate a response. ¹⁹⁾ The LLM synthesizes information from multiple retrieved passages into a coherent, natural-language answer, optionally citing sources for transparency and verifiability. ²⁰⁾

Post-Processing

The generated response may undergo optional post-processing steps including factual verification against retrieved sources, formatting adjustments, PII detection, and hallucination checks before delivery to the user. ²¹⁾