Reranking is a two-stage retrieval process where an initial fast retrieval step (vector similarity or BM25) fetches candidate documents, and a second, more computationally expensive model reorders them by relevance to the query. This produces higher-quality context for LLMs in RAG pipelines and improves search precision. 1)
Initial retrievers compress documents into fixed-size vector embeddings, which limits their ability to capture nuanced semantic relationships between queries and documents. Rerankers use cross-attention mechanisms to jointly process query-document pairs, enabling deeper alignment assessment. This typically yields 20-50% gains in metrics like nDCG@10, MRR, or Hit Rate. 2) 3)
Bi-encoders (used in initial retrieval) encode queries and documents separately into independent embeddings. This enables fast approximate nearest neighbor search across millions of documents but limits the interaction between query and document representations. 4)
Cross-encoders (used in reranking) jointly process query-document pairs with full cross-attention, capturing fine-grained relevance at higher computational cost. They are applied only to small candidate sets (typically top-100 to top-1000).
Cohere Rerank v3 and v3.5 are cross-encoder models optimized for RAG applications. V3.5 improves multilingual support and reduces latency. They integrate easily into existing retrieval pipelines and are commonly used with Cohere embeddings for two-stage retrieval. 5)
Open-source cross-encoder from the Beijing Academy of Artificial Intelligence. Excels in dense retrieval reranking with high accuracy on standard benchmarks, making it a strong choice for enterprise RAG systems that require self-hosting. 6)
Late interaction models that tokenize queries and documents into separate embeddings and compute fine-grained token-level similarities without full cross-attention. ColBERTv2 optimizes efficiency through compression. 7)
Production-focused cross-encoder for multilingual tasks with lightweight inference and strong semantic matching. Commonly used in RAG pipelines requiring domain adaptation. 8)
Fast, local cross-encoder optimized with ONNX for low-latency inference without API calls. Suited for real-time systems where external API latency is unacceptable. 9)
Prompts LLMs (e.g., GPT-4) for pairwise or listwise scoring of candidate documents. Leverages LLM reasoning capabilities and adapts to new domains via few-shot learning, but incurs high cost and latency. 10) 11)
Rerankers add 2-10x latency compared to initial retrieval (typically 50-200ms for top-100 candidates on GPU). Latency scales with the number of candidates processed. 13)
Use distillation, quantization, or FlashRank for latency-sensitive applications.
Use when:
Skip when:
The standard retrieve-then-rerank pattern:
This pattern maximizes efficiency by using the fast first stage to narrow scope for the expensive reranking step. 15)