====== How Does the Retrieval Phase Work in RAG ======

The retrieval phase is the runtime core of a Retrieval-Augmented Generation system. When a user submits a query, this phase processes the question, searches the indexed knowledge base, and assembles the most relevant context for the language model. Retrieval quality is the single most important factor determining RAG system accuracy -- even the most powerful LLM cannot generate a good answer if the retrieved context is weak or irrelevant. ((source [[https://insights.appetenza.com/hybrid-search-and-re-ranking-how-to-dramatically-improve-rag-retrieval-accuracy|Appetenza - Hybrid Search and Re-Ranking]]))

===== Query Processing =====

Before retrieval begins, the user query undergoes pre-processing to optimize search results. ((source [[https://www.aimon.ai/posts/rag_and_its_different_components/|AIMon - RAG Components]]))

==== Query Embedding ====

The natural language query is converted into a dense vector embedding using the **same embedding model** applied during the ingestion phase. This ensures both queries and document chunks exist in the same semantic vector space, enabling meaningful comparison. ((source [[https://www.aimon.ai/posts/rag_and_its_different_components/|AIMon - RAG Components]])) For example, the query "How much annual leave do I have?" is transformed into a numerical vector of 768-1536 dimensions that captures its semantic meaning. ((source [[https://aws.amazon.com/what-is/retrieval-augmented-generation/|AWS - What is RAG]]))

==== Query Expansion and Rewriting ====

Advanced systems may pre-process queries through expansion or rewriting techniques. An LLM can rephrase ambiguous queries, add synonyms, or decompose complex multi-part questions into sub-queries for more comprehensive retrieval. ((source [[https://www.meilisearch.com/blog/rag-types|Meilisearch - RAG Types]])) Tokenization, stemming, and stop-word removal may also be applied to improve matching precision. ((source [[https://cloud.google.com/use-cases/retrieval-augmented-generation|Google Cloud - RAG]]))

===== Similarity Search =====

The query embedding is compared against all stored document vectors to find the most semantically relevant chunks. ((source [[https://www.aimon.ai/posts/rag_and_its_different_components/|AIMon - RAG Components]]))

==== Distance Metrics ====

The primary similarity metric is **cosine similarity**, which measures the cosine of the angle between two vectors. Cosine similarity focuses on directional alignment rather than magnitude, making it well-suited for text embeddings where document length varies. ((source [[https://www.aimon.ai/posts/rag_and_its_different_components/|AIMon - RAG Components]])) Other metrics include Euclidean distance and dot product, each with trade-offs in speed and accuracy depending on the embedding model and use case.

==== Approximate Nearest Neighbor (ANN) Search ====

For production systems with millions or billions of vectors, exact nearest neighbor search is computationally prohibitive. **Approximate Nearest Neighbor** algorithms trade minor precision for dramatic speed improvements: ((source [[https://qdrant.tech/articles/what-is-rag-in-ai/|Qdrant - What is RAG in AI]]))

  * **HNSW (Hierarchical Navigable Small World)**: Builds a multi-layer graph structure where each layer provides increasingly fine-grained navigation to nearest neighbors. Offers high recall with fast query times.
  * **IVF (Inverted File Index)**: Partitions the vector space into clusters (Voronoi cells) and searches only the most relevant partitions, scaling well for massive datasets.
  * **Product Quantization**: Compresses vectors into compact codes for memory-efficient search, often combined with IVF. ((source [[https://medium.com/@tararoutray/the-architecture-behind-vector-databases-in-modern-ai-systems-17a6c8a19095|Routray - Vector Database Architecture]]))

==== Top-K Selection ====

The search returns the top-K (typically 5-20) most similar document chunks ranked by their similarity scores. The value of K balances between providing enough context for comprehensive answers and avoiding noise from marginally relevant passages. ((source [[https://www.aimon.ai/posts/rag_and_its_different_components/|AIMon - RAG Components]]))

===== Re-ranking =====

Initial similarity search casts a wide net. Re-ranking narrows results to the most precisely relevant chunks using more computationally expensive methods. ((source [[https://insights.appetenza.com/hybrid-search-and-re-ranking-how-to-dramatically-improve-rag-retrieval-accuracy|Appetenza - Hybrid Search and Re-Ranking]]))

==== Cross-Encoder Re-ranking ====

Unlike the bi-encoder approach used in initial retrieval (where query and document are embedded independently), **cross-encoders** process the query and each candidate chunk together as a single input. This joint encoding captures fine-grained interactions between query terms and document content, producing more accurate relevance scores. ((source [[https://medium.com/@ashutoshkumars1ngh/hybrid-search-done-right-fixing-rag-retrieval-failures-using-bm25-hnsw-reciprocal-rank-fusion-a73596652d22|Ashutoshkumarsingh - Hybrid Search Done Right]]))

==== Score Thresholding ====

Chunks below a minimum relevance score (typically 0.7-0.8) are discarded to prevent low-quality context from reaching the LLM. Dynamic thresholds can adapt based on query complexity and the score distribution of retrieved results. ((source [[https://coralogix.com/ai-blog/step-by-step-building-a-rag-chatbot-with-minor-hallucinations/|Coralogix - Building a RAG Chatbot]]))

===== Hybrid Retrieval =====

Hybrid retrieval addresses the limitation that neither dense (semantic) nor sparse (keyword) retrieval alone is sufficient for all query types. ((source [[https://medium.com/@vasanthancomrads/hybrid-search-architecture-for-rag-systems-8d5fdad4ba22|Vasanthan - Hybrid Search Architecture for RAG]]))

==== Dense Retrieval ====

Semantic vector search excels at understanding meaning, paraphrased queries, and conceptual similarity. It recognizes that "automobile maintenance" and "car repairs" are related concepts. However, it often fails on exact-match requirements like error codes, product IDs, legal clause references, and version strings. ((source [[https://optyxstack.com/rag-reliability/hybrid-search-reranking-playbook|OptyxStack - Hybrid Search Reranking Playbook]]))

==== Sparse Retrieval ====

**BM25** and TF-IDF provide lexical precision through keyword matching. When a user searches for "Nginx error 502 bad gateway," keyword search ensures exact terms like "502" and "bad gateway" are matched precisely. ((source [[https://insights.appetenza.com/hybrid-search-and-re-ranking-how-to-dramatically-improve-rag-retrieval-accuracy|Appetenza - Hybrid Search and Re-Ranking]]))

==== Fusion Strategies ====

**Reciprocal Rank Fusion (RRF)** is the most common method for combining dense and sparse results. It merges ranked lists from both retrieval methods without requiring score normalization, producing a unified ranking that benefits from both semantic coverage and lexical precision. ((source [[https://medium.com/@ashutoshkumars1ngh/hybrid-search-done-right-fixing-rag-retrieval-failures-using-bm25-hnsw-reciprocal-rank-fusion-a73596652d22|Ashutoshkumarsingh - Hybrid Search Done Right]]))

===== Context Window Assembly =====

The final step assembles the top re-ranked chunks into a structured context for the LLM. ((source [[https://www.aimon.ai/posts/rag_and_its_different_components/|AIMon - RAG Components]]))

The retrieved passages are combined with the original user query and system instructions into an **augmented prompt**. Techniques include truncation or hierarchical summarization to fit within the LLM context window (8K-128K tokens), ordering chunks by relevance score, and deduplication of overlapping content from adjacent chunks. ((source [[https://www.aimon.ai/posts/rag_and_its_different_components/|AIMon - RAG Components]]))

The quality of context assembly directly determines whether the LLM can produce an accurate, well-grounded response or will resort to hallucination.

===== See Also =====

  * [[retrieval_augmented_generation|Retrieval-Augmented Generation]]
  * [[how_to_build_a_rag_pipeline|How to Build a RAG Pipeline]]
  * [[agentic_rag|Agentic RAG]]
  * [[vector_db_comparison|Vector Database Comparison]]
  * [[rag_phases|Phases of a RAG System]]
  * [[rag_ingestion_phase|What Happens During the Ingestion Phase of RAG]]
  * [[vector_database_rag|Role of a Vector Database in AI RAG Architecture]]
  * [[rag_vs_search|How Does a RAG Chatbot Improve Upon Traditional Search]]

===== References =====