The retrieval phase is the runtime core of a Retrieval-Augmented Generation system. When a user submits a query, this phase processes the question, searches the indexed knowledge base, and assembles the most relevant context for the language model. Retrieval quality is the single most important factor determining RAG system accuracy – even the most powerful LLM cannot generate a good answer if the retrieved context is weak or irrelevant. 1)
Before retrieval begins, the user query undergoes pre-processing to optimize search results. 2)
The natural language query is converted into a dense vector embedding using the same embedding model applied during the ingestion phase. This ensures both queries and document chunks exist in the same semantic vector space, enabling meaningful comparison. 3) For example, the query “How much annual leave do I have?” is transformed into a numerical vector of 768-1536 dimensions that captures its semantic meaning. 4)
Advanced systems may pre-process queries through expansion or rewriting techniques. An LLM can rephrase ambiguous queries, add synonyms, or decompose complex multi-part questions into sub-queries for more comprehensive retrieval. 5) Tokenization, stemming, and stop-word removal may also be applied to improve matching precision. 6)
The query embedding is compared against all stored document vectors to find the most semantically relevant chunks. 7)
The primary similarity metric is cosine similarity, which measures the cosine of the angle between two vectors. Cosine similarity focuses on directional alignment rather than magnitude, making it well-suited for text embeddings where document length varies. 8) Other metrics include Euclidean distance and dot product, each with trade-offs in speed and accuracy depending on the embedding model and use case.
For production systems with millions or billions of vectors, exact nearest neighbor search is computationally prohibitive. Approximate Nearest Neighbor algorithms trade minor precision for dramatic speed improvements: 9)
The search returns the top-K (typically 5-20) most similar document chunks ranked by their similarity scores. The value of K balances between providing enough context for comprehensive answers and avoiding noise from marginally relevant passages. 11)
Initial similarity search casts a wide net. Re-ranking narrows results to the most precisely relevant chunks using more computationally expensive methods. 12)
Unlike the bi-encoder approach used in initial retrieval (where query and document are embedded independently), cross-encoders process the query and each candidate chunk together as a single input. This joint encoding captures fine-grained interactions between query terms and document content, producing more accurate relevance scores. 13)
Chunks below a minimum relevance score (typically 0.7-0.8) are discarded to prevent low-quality context from reaching the LLM. Dynamic thresholds can adapt based on query complexity and the score distribution of retrieved results. 14)
Hybrid retrieval addresses the limitation that neither dense (semantic) nor sparse (keyword) retrieval alone is sufficient for all query types. 15)
Semantic vector search excels at understanding meaning, paraphrased queries, and conceptual similarity. It recognizes that “automobile maintenance” and “car repairs” are related concepts. However, it often fails on exact-match requirements like error codes, product IDs, legal clause references, and version strings. 16)
BM25 and TF-IDF provide lexical precision through keyword matching. When a user searches for “Nginx error 502 bad gateway,” keyword search ensures exact terms like “502” and “bad gateway” are matched precisely. 17)
Reciprocal Rank Fusion (RRF) is the most common method for combining dense and sparse results. It merges ranked lists from both retrieval methods without requiring score normalization, producing a unified ranking that benefits from both semantic coverage and lexical precision. 18)
The final step assembles the top re-ranked chunks into a structured context for the LLM. 19)
The retrieved passages are combined with the original user query and system instructions into an augmented prompt. Techniques include truncation or hierarchical summarization to fit within the LLM context window (8K-128K tokens), ordering chunks by relevance score, and deduplication of overlapping content from adjacent chunks. 20)
The quality of context assembly directly determines whether the LLM can produce an accurate, well-grounded response or will resort to hallucination.