Query Processing
- Query Embedding
- Query Expansion and Rewriting
Similarity Search
Re-ranking
- Cross-Encoder Re-ranking
- Score Thresholding
Hybrid Retrieval
Context Window Assembly
See Also
References

How Does the Retrieval Phase Work in RAG

The retrieval phase is the runtime core of a Retrieval-Augmented Generation system. When a user submits a query, this phase processes the question, searches the indexed knowledge base, and assembles the most relevant context for the language model. Retrieval quality is the single most important factor determining RAG system accuracy – even the most powerful LLM cannot generate a good answer if the retrieved context is weak or irrelevant. ¹⁾

Query Processing

Before retrieval begins, the user query undergoes pre-processing to optimize search results. ²⁾

Query Embedding

The natural language query is converted into a dense vector embedding using the same embedding model applied during the ingestion phase. This ensures both queries and document chunks exist in the same semantic vector space, enabling meaningful comparison. ³⁾ For example, the query “How much annual leave do I have?” is transformed into a numerical vector of 768-1536 dimensions that captures its semantic meaning. ⁴⁾

Query Expansion and Rewriting

Advanced systems may pre-process queries through expansion or rewriting techniques. An LLM can rephrase ambiguous queries, add synonyms, or decompose complex multi-part questions into sub-queries for more comprehensive retrieval. ⁵⁾ Tokenization, stemming, and stop-word removal may also be applied to improve matching precision. ⁶⁾

Similarity Search

The query embedding is compared against all stored document vectors to find the most semantically relevant chunks. ⁷⁾

Distance Metrics

The primary similarity metric is cosine similarity, which measures the cosine of the angle between two vectors. Cosine similarity focuses on directional alignment rather than magnitude, making it well-suited for text embeddings where document length varies. ⁸⁾ Other metrics include Euclidean distance and dot product, each with trade-offs in speed and accuracy depending on the embedding model and use case.

Approximate Nearest Neighbor (ANN) Search

For production systems with millions or billions of vectors, exact nearest neighbor search is computationally prohibitive. Approximate Nearest Neighbor algorithms trade minor precision for dramatic speed improvements: ⁹⁾

HNSW (Hierarchical Navigable Small World): Builds a multi-layer graph structure where each layer provides increasingly fine-grained navigation to nearest neighbors. Offers high recall with fast query times.
IVF (Inverted File Index): Partitions the vector space into clusters (Voronoi cells) and searches only the most relevant partitions, scaling well for massive datasets.
Product Quantization: Compresses vectors into compact codes for memory-efficient search, often combined with IVF. ¹⁰⁾

Top-K Selection

The search returns the top-K (typically 5-20) most similar document chunks ranked by their similarity scores. The value of K balances between providing enough context for comprehensive answers and avoiding noise from marginally relevant passages. ¹¹⁾

Re-ranking

Initial similarity search casts a wide net. Re-ranking narrows results to the most precisely relevant chunks using more computationally expensive methods. ¹²⁾

Cross-Encoder Re-ranking

Unlike the bi-encoder approach used in initial retrieval (where query and document are embedded independently), cross-encoders process the query and each candidate chunk together as a single input. This joint encoding captures fine-grained interactions between query terms and document content, producing more accurate relevance scores. ¹³⁾

Score Thresholding

Chunks below a minimum relevance score (typically 0.7-0.8) are discarded to prevent low-quality context from reaching the LLM. Dynamic thresholds can adapt based on query complexity and the score distribution of retrieved results. ¹⁴⁾

Hybrid Retrieval

Hybrid retrieval addresses the limitation that neither dense (semantic) nor sparse (keyword) retrieval alone is sufficient for all query types. ¹⁵⁾

Dense Retrieval

Semantic vector search excels at understanding meaning, paraphrased queries, and conceptual similarity. It recognizes that “automobile maintenance” and “car repairs” are related concepts. However, it often fails on exact-match requirements like error codes, product IDs, legal clause references, and version strings. ¹⁶⁾

Sparse Retrieval

BM25 and TF-IDF provide lexical precision through keyword matching. When a user searches for “Nginx error 502 bad gateway,” keyword search ensures exact terms like “502” and “bad gateway” are matched precisely. ¹⁷⁾

Fusion Strategies

Reciprocal Rank Fusion (RRF) is the most common method for combining dense and sparse results. It merges ranked lists from both retrieval methods without requiring score normalization, producing a unified ranking that benefits from both semantic coverage and lexical precision. ¹⁸⁾

Context Window Assembly

The final step assembles the top re-ranked chunks into a structured context for the LLM. ¹⁹⁾

The retrieved passages are combined with the original user query and system instructions into an augmented prompt. Techniques include truncation or hierarchical summarization to fit within the LLM context window (8K-128K tokens), ordering chunks by relevance score, and deduplication of overlapping content from adjacent chunks. ²⁰⁾

The quality of context assembly directly determines whether the LLM can produce an accurate, well-grounded response or will resort to hallucination.