Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Retrieval-Augmented Generation (RAG) is a system design pattern that enhances large language models by connecting them to external knowledge bases at query time. 1) A RAG pipeline operates through three interconnected phases: indexing, retrieval, and generation. Each phase addresses a distinct part of the workflow, transforming raw data into accurate, grounded AI responses. 2)
The indexing phase occurs offline, before any user query is processed. Its purpose is to prepare external data sources for rapid, semantically meaningful retrieval. 3)
Raw documents such as PDFs, web pages, database records, and internal knowledge bases are loaded and split into smaller, manageable pieces called chunks. Chunking ensures that retrieval can target specific, relevant passages rather than entire documents. 4) Common strategies include fixed-size splitting (typically 500-2000 tokens), recursive splitting by document structure, and semantic chunking that preserves meaning boundaries. 5)
Each chunk is converted into a dense vector embedding – a fixed-length numerical array (commonly 768-1536 dimensions) that captures the semantic meaning of the text. 6) Pre-trained transformer models such as BERT, Sentence-BERT, or OpenAI text-embedding-ada-002 perform this conversion. These embeddings encode contextual relationships so that semantically similar content maps to nearby points in vector space. 7)
The resulting embeddings, along with the original text chunks and associated metadata (source file, page number, timestamps), are stored in a vector database such as FAISS, Pinecone, Milvus, or Qdrant. 8) These databases use indexing algorithms like HNSW (Hierarchical Navigable Small World) or IVF (Inverted File) to enable fast approximate nearest neighbor (ANN) searches across millions of vectors. 9)
Because indexing is a one-time upfront cost, RAG is a scalable and cost-effective solution – data sources can be updated without retraining the underlying LLM. 10)
When a user submits a query, the retrieval phase activates to find the most relevant information from the indexed knowledge base. 11)
The user query is converted into a vector representation using the same embedding model that was applied during indexing, ensuring both queries and documents exist in the same semantic space. 12)
The query vector is compared against stored document vectors using distance metrics such as cosine similarity or dot product. Approximate Nearest Neighbor (ANN) algorithms retrieve the top-K most semantically relevant chunks efficiently, even across billions of vectors. 13)
Initial retrieval results may be further refined through re-ranking using cross-encoder models that score query-chunk pairs for deeper relevance assessment. Metadata filters (date ranges, source authority, document type) can further narrow results to the most precise context. 14)
Advanced RAG systems combine dense retrieval (semantic vector similarity) with sparse retrieval (keyword-based methods like BM25 or TF-IDF). Scores from both approaches are fused using techniques such as Reciprocal Rank Fusion (RRF), leveraging semantic understanding for meaning and lexical precision for exact term matching. 15)
The generation phase synthesizes retrieved information with the original query to produce a grounded, contextual response. 16)
The original user prompt is enriched with the top-ranked retrieved chunks to form an augmented prompt. A typical template structures the input as context passages followed by the user question with explicit instructions to answer only from the provided context. 17) Techniques like few-shot prompting, context compression, and attention masking further optimize how the LLM processes the augmented input. 18)
The augmented prompt is submitted to the large language model, which uses both its training data and the retrieved external knowledge to generate a response. 19) The LLM synthesizes information from multiple retrieved passages into a coherent, natural-language answer, optionally citing sources for transparency and verifiability. 20)
The generated response may undergo optional post-processing steps including factual verification against retrieved sources, formatting adjustments, PII detection, and hallucination checks before delivery to the user. 21)