Table of Contents

How Does AI RAG Work

Retrieval-Augmented Generation (RAG) is a system design pattern that enhances large language models by retrieving relevant external data at query time and incorporating it into the generation process. Rather than relying solely on information memorized during training, RAG grounds LLM responses in actual documents, reducing hallucinations and enabling accurate answers about proprietary, domain-specific, or recently updated information. 1)

Why RAG Exists

LLMs are trained on large but finite datasets with a fixed knowledge cutoff date. They cannot access private organizational data, and they sometimes generate confident but fabricated answers (hallucinations). 2) Fine-tuning a model on new data is expensive, inflexible, and requires retraining for every update. RAG solves these problems by decoupling the knowledge source from the model itself – documents can be updated, added, or removed without touching the LLM. 3)

Step 1: Document Preparation and Chunking

The RAG pipeline begins with preparing the source documents that will serve as the knowledge base. Raw documents – PDFs, Word files, web pages, database records, Confluence pages, Slack messages – are ingested and split into smaller chunks of typically 300-1000 tokens each. 4)

Chunking is necessary because embedding an entire document as one vector dilutes important details. Smaller chunks allow the retrieval system to pinpoint specific, relevant passages rather than returning entire documents. 5)

Common chunking strategies include:

Preprocessing may also include cleaning (removing noise and formatting artifacts), metadata extraction (timestamps, authors, source identifiers), and normalization. 6)

Step 2: Embedding Generation

Each chunk is converted into a dense vector embedding – a fixed-length numerical array that captures the semantic meaning of the text. 7)

Embedding models like OpenAI text-embedding-3-small (1536 dimensions), Sentence-BERT, Cohere Embed, or E5 use transformer architectures to encode text into high-dimensional vectors. These vectors place semantically similar content near each other in vector space: “automobile maintenance” and “car repairs” will have nearby vector representations despite sharing no words. 8)

The mathematical core is the transformer's attention mechanism, which weighs the contextual relationships between every token in the input to produce a single embedding that represents the meaning of the entire passage. 9)

Step 3: Vector Storage and Indexing

Embeddings are stored in a vector database (Pinecone, Milvus, Qdrant, pgvector, ChromaDB, FAISS) alongside the original text chunks and metadata. 10)

The database builds indexes for fast approximate nearest neighbor (ANN) search:

These indexes enable sub-second retrieval across millions of vectors, making RAG practical for production workloads. 11)

Step 4: Query Processing

When a user submits a question, it is embedded into a vector using the same embedding model used during ingestion. This ensures the query and documents exist in the same semantic space for meaningful comparison. 12)

Advanced systems may apply query expansion or rewriting before embedding. An LLM can rephrase ambiguous queries, add synonyms, or decompose complex multi-part questions into sub-queries to improve retrieval coverage. 13)

Step 5: Retrieval

The query vector is compared against all stored document vectors using cosine similarity:

cos(q, c) = (q . c) / (||q|| * ||c||)

This measures the directional alignment between vectors, identifying chunks whose semantic meaning is closest to the query regardless of document length. 14)

The search returns the top-K (typically 3-10) most similar chunks. These candidates may then undergo re-ranking using cross-encoder models that jointly process the query and each chunk for more accurate relevance scoring. 15)

Hybrid retrieval combines dense vector search with sparse keyword search (BM25) using fusion techniques like Reciprocal Rank Fusion (RRF), ensuring both semantic understanding and lexical precision. 16)

Step 6: Prompt Augmentation

The top retrieved chunks are assembled into an augmented prompt that structures the context for the LLM:

Context: [retrieved_chunk_1]
[retrieved_chunk_2]
[retrieved_chunk_3]

Question: [user_query]
Answer using only the context above.

The prompt template instructs the LLM to answer exclusively from the provided context, which reduces hallucination. 17) Additional techniques include few-shot examples, context compression (summarizing chunks to fit within token limits), and attention masking to prioritize the most relevant passages. 18)

For long contexts, hierarchical summarization or truncation ensures the augmented prompt fits within the LLM's context window (8K-128K tokens depending on the model). 19)

Step 7: LLM Response Generation

The augmented prompt is submitted to the LLM (GPT-4, Claude, Llama, Mistral, or other models), which synthesizes information from the retrieved passages into a coherent, natural-language response. 20)

The LLM combines its language understanding capabilities with the retrieved evidence to:

Optional post-processing may include faithfulness checking (verifying claims against retrieved sources), PII redaction, formatting adjustments, and response streaming for better user experience. 22)

End-to-End Flow

The complete RAG process runs in milliseconds to seconds:

Documents -> Chunk -> Embed -> Store (offline, one-time)
Query -> Embed -> Search -> Retrieve -> Augment Prompt -> Generate -> Response (runtime, per query)

This architecture scales with vector database capacity, supports real-time knowledge updates by re-embedding new documents without LLM retraining, and can be extended with multimodal retrieval (images, tables), agentic reasoning loops, and graph-based knowledge structures. 23)

See Also

References

6) , 7) , 9) , 18) , 19)