Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
This is an old revision of the document!
Agentic retrieval systems for multi-hop question answering deploy specialized LLM agents to iteratively decompose queries, filter evidence for precision, and recover missing facts for recall, dramatically outperforming single-step RAG on complex reasoning tasks.
Multi-hop question answering requires synthesizing information from multiple documents to answer questions that cannot be resolved with a single retrieval step. Standard RAG (Retrieval-Augmented Generation) retrieves passages once and feeds them to an LLM reader, but this approach often misses crucial evidence or includes distractors that degrade QA performance. PRISM introduces an agentic Precision-Recall Iterative Selection Mechanism, while MA-RAG deploys collaborative chain-of-thought agents for multi-hop reasoning.
PRISM deploys three specialized LLM agents in an iterative loop:
The Selector-Adder loop typically runs 2-3 iterations, balancing precision and recall without excessively expanding context.
Fact-wise Precision and Recall: Evaluated at the sentence level:
<latex>\text{Precision} = \frac{|\text{Supporting Sentences} \cap \text{Retrieved}|}{|\text{Retrieved}|}</latex>
<latex>\text{Recall} = \frac{|\text{Supporting Sentences} \cap \text{Retrieved}|}{|\text{Gold Supporting Sentences}|}</latex>
End-to-end QA is measured by Exact Match (EM) and F1 using only the filtered evidence versus full context or gold passages.
MA-RAG extends multi-hop QA with collaborative chain-of-thought reasoning among multiple LLM agents:
The collaborative verification process reduces hallucination by requiring agent consensus on reasoning paths:
<latex>P(\text{fact correct}) = 1 - \prod_{i=1}^{K} (1 - p_i)</latex>
where $p_i$ is the confidence of agent $i$ in a fact, and $K$ agents must independently verify each claim.
The key limitations of single-step RAG that agentic approaches address:
Agentic approaches reduce irrelevant tokens by 50-80% while boosting QA EM/F1, with PRISM achieving 90%+ recall versus 60-70% for single-step methods.
from dataclasses import dataclass @dataclass class Evidence: text: str source: str relevance_score: float is_supporting: bool = False class PRISMAgent: def __init__(self, llm, retriever): self.llm = llm self.retriever = retriever def decompose_question(self, question: str) -> list[str]: sub_questions = self.llm.generate( f"Decompose into sub-questions for multi-hop:\n" f"Q: {question}\nSub-questions:" ) return self.parse_sub_questions(sub_questions) def select_for_precision(self, sub_q: str, candidates: list[Evidence]) -> list[Evidence]: scored = [] for c in candidates: relevance = self.llm.generate( f"Is this passage relevant to '{sub_q}'?\n" f"Passage: {c.text}\nAnswer (yes/no + score):" ) c.relevance_score = self.parse_score(relevance) scored.append(c) return [c for c in scored if c.relevance_score > 0.5] def add_for_recall(self, sub_q: str, selected: list[Evidence]) -> list[Evidence]: gaps = self.llm.generate( f"What evidence is missing to answer '{sub_q}'?\n" f"Current evidence: {[e.text for e in selected]}" ) new_queries = self.parse_gap_queries(gaps) additions = [] for q in new_queries: results = self.retriever.search(q, top_k=3) additions.extend(self.select_for_precision(sub_q, results)) return selected + additions def answer(self, question: str, max_iterations: int = 3) -> str: sub_questions = self.decompose_question(question) all_evidence = [] for sub_q in sub_questions: candidates = self.retriever.search(sub_q, top_k=10) selected = self.select_for_precision(sub_q, candidates) for _ in range(max_iterations): enriched = self.add_for_recall(sub_q, selected) if len(enriched) == len(selected): break selected = enriched all_evidence.extend(selected) return self.llm.generate( f"Answer based on evidence:\n" f"Q: {question}\nEvidence: {[e.text for e in all_evidence]}" )
| Dataset | PRISM Recall | Baseline Recall | PRISM QA |
|---|---|---|---|
| HotpotQA | 90.9% | 61.5-72.8% | SOTA |
| 2WikiMultiHop | 91.1% | 68.1-90.7% | Competitive |
| MuSiQue | High | Lower | SOTA (surpasses full context) |
| MultiHopRAG | High | Baselines | SOTA (surpasses full context) |
PRISM reduces retrieved tokens by 50-80% while maintaining or exceeding full-context QA accuracy, demonstrating that precise evidence selection is more effective than feeding entire documents to readers.