====== How Does AI RAG Work ======

Retrieval-Augmented Generation (RAG) is a system design pattern that enhances large language models by retrieving relevant external data at query time and incorporating it into the generation process. Rather than relying solely on information memorized during training, RAG grounds LLM responses in actual documents, reducing hallucinations and enabling accurate answers about proprietary, domain-specific, or recently updated information. ((source [[https://www.ibm.com/think/topics/retrieval-augmented-generation|IBM - What is RAG]]))

===== Why RAG Exists =====

LLMs are trained on large but finite datasets with a fixed knowledge cutoff date. They cannot access private organizational data, and they sometimes generate confident but fabricated answers (hallucinations). ((source [[https://www.databricks.com/blog/what-is-retrieval-augmented-generation|Databricks - What is RAG]])) Fine-tuning a model on new data is expensive, inflexible, and requires retraining for every update. RAG solves these problems by decoupling the knowledge source from the model itself -- documents can be updated, added, or removed without touching the LLM. ((source [[https://glyphsignal.com/guides/rag-guide|GlyphSignal - RAG Guide 2026]]))

===== Step 1: Document Preparation and Chunking =====

The RAG pipeline begins with preparing the source documents that will serve as the knowledge base. Raw documents -- PDFs, Word files, web pages, database records, Confluence pages, Slack messages -- are ingested and split into smaller **chunks** of typically 300-1000 tokens each. ((source [[https://www.databricks.com/blog/what-is-retrieval-augmented-generation|Databricks - What is RAG]]))

Chunking is necessary because embedding an entire document as one vector dilutes important details. Smaller chunks allow the retrieval system to pinpoint specific, relevant passages rather than returning entire documents. ((source [[https://mbrenndoerfer.com/writing/document-chunking-rag-strategies-retrieval|Brenndoerfer - Document Chunking for RAG]]))

Common chunking strategies include:

  * **Recursive splitting**: Hierarchically divide by paragraph, sentence, then character boundaries
  * **Semantic chunking**: Group by meaning or topic using embedding similarity between segments
  * **Structure-based splitting**: Use document headings, sections, and formatting as natural break points

Preprocessing may also include cleaning (removing noise and formatting artifacts), metadata extraction (timestamps, authors, source identifiers), and normalization. ((source [[https://www.intersystems.com/resources/retrieval-augmented-generation/|InterSystems - RAG]]))

===== Step 2: Embedding Generation =====

Each chunk is converted into a **dense vector embedding** -- a fixed-length numerical array that captures the semantic meaning of the text. ((source [[https://www.intersystems.com/resources/retrieval-augmented-generation/|InterSystems - RAG]]))

Embedding models like OpenAI text-embedding-3-small (1536 dimensions), Sentence-BERT, Cohere Embed, or E5 use transformer architectures to encode text into high-dimensional vectors. These vectors place semantically similar content near each other in vector space: "automobile maintenance" and "car repairs" will have nearby vector representations despite sharing no words. ((source [[https://medium.com/@iamanraghuvanshi/vector-embeddings-and-vector-databases-0cd0e2a8d95b|Raghuvanshi - Vector Embeddings and Vector Databases]]))

The mathematical core is the transformer's attention mechanism, which weighs the contextual relationships between every token in the input to produce a single embedding that represents the meaning of the entire passage. ((source [[https://www.intersystems.com/resources/retrieval-augmented-generation/|InterSystems - RAG]]))

===== Step 3: Vector Storage and Indexing =====

Embeddings are stored in a **vector database** (Pinecone, Milvus, Qdrant, pgvector, ChromaDB, FAISS) alongside the original text chunks and metadata. ((source [[https://qdrant.tech/articles/what-is-rag-in-ai/|Qdrant - What is RAG in AI]]))

The database builds **indexes** for fast approximate nearest neighbor (ANN) search:

  * **HNSW**: Multi-layer graph structure enabling fast, high-recall navigation to nearest neighbors
  * **IVF**: Partition-based indexing that clusters vectors and searches only relevant partitions

These indexes enable sub-second retrieval across millions of vectors, making RAG practical for production workloads. ((source [[https://medium.com/@tararoutray/the-architecture-behind-vector-databases-in-modern-ai-systems-17a6c8a19095|Routray - Vector Database Architecture]]))

===== Step 4: Query Processing =====

When a user submits a question, it is embedded into a vector using the **same embedding model** used during ingestion. This ensures the query and documents exist in the same semantic space for meaningful comparison. ((source [[https://aws.amazon.com/what-is/retrieval-augmented-generation/|AWS - What is RAG]]))

Advanced systems may apply **query expansion** or rewriting before embedding. An LLM can rephrase ambiguous queries, add synonyms, or decompose complex multi-part questions into sub-queries to improve retrieval coverage. ((source [[https://www.meilisearch.com/blog/rag-types|Meilisearch - RAG Types]]))

===== Step 5: Retrieval =====

The query vector is compared against all stored document vectors using **cosine similarity**:

  cos(q, c) = (q . c) / (||q|| * ||c||)

This measures the directional alignment between vectors, identifying chunks whose semantic meaning is closest to the query regardless of document length. ((source [[https://www.aimon.ai/posts/rag_and_its_different_components/|AIMon - RAG Components]]))

The search returns the **top-K** (typically 3-10) most similar chunks. These candidates may then undergo **re-ranking** using cross-encoder models that jointly process the query and each chunk for more accurate relevance scoring. ((source [[https://insights.appetenza.com/hybrid-search-and-re-ranking-how-to-dramatically-improve-rag-retrieval-accuracy|Appetenza - Hybrid Search and Re-Ranking]]))

**Hybrid retrieval** combines dense vector search with sparse keyword search (BM25) using fusion techniques like Reciprocal Rank Fusion (RRF), ensuring both semantic understanding and lexical precision. ((source [[https://medium.com/@ashutoshkumars1ngh/hybrid-search-done-right-fixing-rag-retrieval-failures-using-bm25-hnsw-reciprocal-rank-fusion-a73596652d22|Ashutoshkumarsingh - Hybrid Search Done Right]]))

===== Step 6: Prompt Augmentation =====

The top retrieved chunks are assembled into an **augmented prompt** that structures the context for the LLM:

  Context: [retrieved_chunk_1]
  [retrieved_chunk_2]
  [retrieved_chunk_3]
  
  Question: [user_query]
  Answer using only the context above.

The prompt template instructs the LLM to answer exclusively from the provided context, which reduces hallucination. ((source [[https://www.databricks.com/blog/what-is-retrieval-augmented-generation|Databricks - What is RAG]])) Additional techniques include few-shot examples, context compression (summarizing chunks to fit within token limits), and attention masking to prioritize the most relevant passages. ((source [[https://www.intersystems.com/resources/retrieval-augmented-generation/|InterSystems - RAG]]))

For long contexts, hierarchical summarization or truncation ensures the augmented prompt fits within the LLM's context window (8K-128K tokens depending on the model). ((source [[https://www.intersystems.com/resources/retrieval-augmented-generation/|InterSystems - RAG]]))

===== Step 7: LLM Response Generation =====

The augmented prompt is submitted to the LLM (GPT-4, Claude, Llama, Mistral, or other models), which synthesizes information from the retrieved passages into a coherent, natural-language response. ((source [[https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/|NVIDIA - What is RAG]]))

The LLM combines its language understanding capabilities with the retrieved evidence to:

  * Synthesize information from multiple passages into a unified answer
  * Maintain coherent narrative structure
  * Cite sources when prompted for transparency
  * Handle nuanced questions that require reasoning across multiple chunks ((source [[https://www.k2view.com/what-is-retrieval-augmented-generation|K2View - What is RAG]]))

Optional post-processing may include faithfulness checking (verifying claims against retrieved sources), PII redaction, formatting adjustments, and response streaming for better user experience. ((source [[https://www.weka.io/learn/guide/ai-ml/retrieval-augmented-generation/|WEKA - RAG Guide]]))

===== End-to-End Flow =====

The complete RAG process runs in milliseconds to seconds:

  Documents -> Chunk -> Embed -> Store (offline, one-time)
  Query -> Embed -> Search -> Retrieve -> Augment Prompt -> Generate -> Response (runtime, per query)

This architecture scales with vector database capacity, supports real-time knowledge updates by re-embedding new documents without LLM retraining, and can be extended with multimodal retrieval (images, tables), agentic reasoning loops, and graph-based knowledge structures. ((source [[https://www.databricks.com/blog/what-is-retrieval-augmented-generation|Databricks - What is RAG]]))

===== See Also =====

  * [[retrieval_augmented_generation|Retrieval-Augmented Generation]]
  * [[how_to_build_a_rag_pipeline|How to Build a RAG Pipeline]]
  * [[agentic_rag|Agentic RAG]]
  * [[vector_db_comparison|Vector Database Comparison]]
  * [[rag_phases|Phases of a RAG System]]
  * [[rag_ingestion_phase|What Happens During the Ingestion Phase of RAG]]
  * [[rag_retrieval_phase|How Does the Retrieval Phase Work in RAG]]
  * [[vector_database_rag|Role of a Vector Database in AI RAG Architecture]]
  * [[rag_vs_search|How Does a RAG Chatbot Improve Upon Traditional Search]]

===== References =====