====== Multi-Hop QA Agents ====== Agentic retrieval systems for multi-hop question answering deploy specialized LLM agents to iteratively decompose queries, filter evidence for precision, and recover missing facts for recall, dramatically outperforming single-step RAG on complex reasoning tasks. ===== Overview ===== Multi-hop question answering requires synthesizing information from multiple documents to answer questions that cannot be resolved with a single retrieval step. Standard RAG (Retrieval-Augmented Generation) retrieves passages once and feeds them to an LLM reader, but this approach often misses crucial evidence or includes distractors that degrade QA performance. PRISM(([[https://arxiv.org/abs/2510.14278|"PRISM: Precision-Recall Iterative Selection Mechanism for Agentic Multi-Hop QA." arXiv:2510.14278, 2025.]])) introduces an agentic Precision-Recall Iterative Selection Mechanism, while MA-RAG(([[https://arxiv.org/abs/2505.20096|"MA-RAG: Multi-Agent Collaborative Chain-of-Thought for Retrieval-Augmented Generation." arXiv:2505.20096, 2025.]])) deploys collaborative chain-of-thought agents for multi-hop reasoning. ===== PRISM: Agentic Retrieval for Multi-Hop QA ===== PRISM deploys three specialized LLM agents in an iterative loop: * **Question Analyzer Agent**: Decomposes multi-hop queries into sub-questions. For example, "Which city is the capital of the country where X was born?" becomes sub-questions about X's birth country and that country's capital. * **Selector Agent**: Reranks initial retrievals per sub-question for **precision**, filtering out distractor passages that could mislead the reader. * **Adder Agent**: Identifies gaps in the Selector's output and retrieves **recall-focused** additions via new queries or reranking of candidate passages. The Selector-Adder loop typically runs 2-3 iterations, balancing precision and recall without excessively expanding context. **Fact-wise Precision and Recall**: Evaluated at the sentence level: \text{Precision} = \frac{|\text{Supporting Sentences} \cap \text{Retrieved}|}{|\text{Retrieved}|} \text{Recall} = \frac{|\text{Supporting Sentences} \cap \text{Retrieved}|}{|\text{Gold Supporting Sentences}|} End-to-end QA is measured by Exact Match (EM) and F1 using only the filtered evidence versus full context or gold passages. ===== MA-RAG: Collaborative Chain-of-Thought ===== MA-RAG extends multi-hop QA with collaborative chain-of-thought reasoning among multiple LLM agents: * **Decomposition Agent**: Breaks complex questions into reasoning chains * **Retrieval Agent**: Fetches evidence for each reasoning step * **Verification Agent**: Cross-checks retrieved facts across agents * **Synthesis Agent**: Combines verified evidence into a coherent answer The collaborative verification process reduces hallucination by requiring agent consensus on reasoning paths: P(\text{fact correct}) = 1 - \prod_{i=1}^{K} (1 - p_i) where $p_i$ is the confidence of agent $i$ in a fact, and $K$ agents must independently verify each claim. ===== Beyond Single-Step RAG ===== The key limitations of single-step RAG that agentic approaches address: * **No decomposition**: Single-step RAG cannot break complex questions into sub-queries * **No iteration**: One retrieval pass often misses bridging entities needed for multi-hop reasoning * **No filtering**: Retrieved passages include distractors that mislead the reader model * **No verification**: No mechanism to check whether evidence actually supports the answer Agentic approaches reduce irrelevant tokens by 50-80% while boosting QA EM/F1, with PRISM achieving 90%+ recall versus 60-70% for single-step methods. ===== Code Example ===== from dataclasses import dataclass @dataclass class Evidence: text: str source: str relevance_score: float is_supporting: bool = False class PRISMAgent: def __init__(self, llm, retriever): self.llm = llm self.retriever = retriever def decompose_question(self, question: str) -> list[str]: sub_questions = self.llm.generate( f"Decompose into sub-questions for multi-hop:\n" f"Q: {question}\nSub-questions:" ) return self.parse_sub_questions(sub_questions) def select_for_precision(self, sub_q: str, candidates: list[Evidence]) -> list[Evidence]: scored = [] for c in candidates: relevance = self.llm.generate( f"Is this passage relevant to '{sub_q}'?\n" f"Passage: {c.text}\nAnswer (yes/no + score):" ) c.relevance_score = self.parse_score(relevance) scored.append(c) return [c for c in scored if c.relevance_score > 0.5] def add_for_recall(self, sub_q: str, selected: list[Evidence]) -> list[Evidence]: gaps = self.llm.generate( f"What evidence is missing to answer '{sub_q}'?\n" f"Current evidence: {[e.text for e in selected]}" ) new_queries = self.parse_gap_queries(gaps) additions = [] for q in new_queries: results = self.retriever.search(q, top_k=3) additions.extend(self.select_for_precision(sub_q, results)) return selected + additions def answer(self, question: str, max_iterations: int = 3) -> str: sub_questions = self.decompose_question(question) all_evidence = [] for sub_q in sub_questions: candidates = self.retriever.search(sub_q, top_k=10) selected = self.select_for_precision(sub_q, candidates) for _ in range(max_iterations): enriched = self.add_for_recall(sub_q, selected) if len(enriched) == len(selected): break selected = enriched all_evidence.extend(selected) return self.llm.generate( f"Answer based on evidence:\n" f"Q: {question}\nEvidence: {[e.text for e in all_evidence]}" ) ===== Architecture ===== graph TD A[Multi-Hop Question] --> B[Question Analyzer] B --> C[Sub-Question 1] B --> D[Sub-Question 2] B --> E[Sub-Question N] C --> F[Initial Retrieval] D --> F E --> F F --> G[Selector Agent - Precision] G --> H[Filtered Evidence Set] H --> I[Adder Agent - Recall] I --> J{Gaps Found?} J -->|Yes| K[New Retrieval Queries] K --> G J -->|No| L[Final Evidence Set] L --> M[QA Reader Model] M --> N[Answer] subgraph MA-RAG Extension O[Agent 1 CoT] --> P[Verification] Q[Agent 2 CoT] --> P R[Agent K CoT] --> P P --> S[Consensus Answer] end ===== Key Results ===== ^ Dataset ^ PRISM Recall ^ Baseline Recall ^ PRISM QA ^ | HotpotQA | 90.9% | 61.5-72.8% | SOTA | | 2WikiMultiHop | 91.1% | 68.1-90.7% | Competitive | | MuSiQue | High | Lower | SOTA (surpasses full context) | | MultiHopRAG | High | Baselines | SOTA (surpasses full context) | PRISM reduces retrieved tokens by 50-80% while maintaining or exceeding full-context QA accuracy, demonstrating that precise evidence selection is more effective than feeding entire documents to readers. ===== See Also ===== * [[budget_aware_reasoning|Budget-Aware Reasoning]] * [[software_testing_agents|Software Testing Agents]] * [[robotic_manipulation_agents|Robotic Manipulation Agents]] ===== References =====