====== Multi-Hop QA Agents ======
Agentic retrieval systems for multi-hop question answering deploy specialized LLM agents to iteratively decompose queries, filter evidence for precision, and recover missing facts for recall, dramatically outperforming single-step RAG on complex reasoning tasks.
===== Overview =====
Multi-hop question answering requires synthesizing information from multiple documents to answer questions that cannot be resolved with a single retrieval step. Standard RAG (Retrieval-Augmented Generation) retrieves passages once and feeds them to an LLM reader, but this approach often misses crucial evidence or includes distractors that degrade QA performance. PRISM(([[https://arxiv.org/abs/2510.14278|"PRISM: Precision-Recall Iterative Selection Mechanism for Agentic Multi-Hop QA." arXiv:2510.14278, 2025.]])) introduces an agentic Precision-Recall Iterative Selection Mechanism, while MA-RAG(([[https://arxiv.org/abs/2505.20096|"MA-RAG: Multi-Agent Collaborative Chain-of-Thought for Retrieval-Augmented Generation." arXiv:2505.20096, 2025.]])) deploys collaborative chain-of-thought agents for multi-hop reasoning.
===== PRISM: Agentic Retrieval for Multi-Hop QA =====
PRISM deploys three specialized LLM agents in an iterative loop:
* **Question Analyzer Agent**: Decomposes multi-hop queries into sub-questions. For example, "Which city is the capital of the country where X was born?" becomes sub-questions about X's birth country and that country's capital.
* **Selector Agent**: Reranks initial retrievals per sub-question for **precision**, filtering out distractor passages that could mislead the reader.
* **Adder Agent**: Identifies gaps in the Selector's output and retrieves **recall-focused** additions via new queries or reranking of candidate passages.
The Selector-Adder loop typically runs 2-3 iterations, balancing precision and recall without excessively expanding context.
**Fact-wise Precision and Recall**: Evaluated at the sentence level:
\text{Precision} = \frac{|\text{Supporting Sentences} \cap \text{Retrieved}|}{|\text{Retrieved}|}
\text{Recall} = \frac{|\text{Supporting Sentences} \cap \text{Retrieved}|}{|\text{Gold Supporting Sentences}|}
End-to-end QA is measured by Exact Match (EM) and F1 using only the filtered evidence versus full context or gold passages.
===== MA-RAG: Collaborative Chain-of-Thought =====
MA-RAG extends multi-hop QA with collaborative chain-of-thought reasoning among multiple LLM agents:
* **Decomposition Agent**: Breaks complex questions into reasoning chains
* **Retrieval Agent**: Fetches evidence for each reasoning step
* **Verification Agent**: Cross-checks retrieved facts across agents
* **Synthesis Agent**: Combines verified evidence into a coherent answer
The collaborative verification process reduces hallucination by requiring agent consensus on reasoning paths:
P(\text{fact correct}) = 1 - \prod_{i=1}^{K} (1 - p_i)
where $p_i$ is the confidence of agent $i$ in a fact, and $K$ agents must independently verify each claim.
===== Beyond Single-Step RAG =====
The key limitations of single-step RAG that agentic approaches address:
* **No decomposition**: Single-step RAG cannot break complex questions into sub-queries
* **No iteration**: One retrieval pass often misses bridging entities needed for multi-hop reasoning
* **No filtering**: Retrieved passages include distractors that mislead the reader model
* **No verification**: No mechanism to check whether evidence actually supports the answer
Agentic approaches reduce irrelevant tokens by 50-80% while boosting QA EM/F1, with PRISM achieving 90%+ recall versus 60-70% for single-step methods.
===== Code Example =====
from dataclasses import dataclass
@dataclass
class Evidence:
text: str
source: str
relevance_score: float
is_supporting: bool = False
class PRISMAgent:
def __init__(self, llm, retriever):
self.llm = llm
self.retriever = retriever
def decompose_question(self, question: str) -> list[str]:
sub_questions = self.llm.generate(
f"Decompose into sub-questions for multi-hop:\n"
f"Q: {question}\nSub-questions:"
)
return self.parse_sub_questions(sub_questions)
def select_for_precision(self, sub_q: str,
candidates: list[Evidence]) -> list[Evidence]:
scored = []
for c in candidates:
relevance = self.llm.generate(
f"Is this passage relevant to '{sub_q}'?\n"
f"Passage: {c.text}\nAnswer (yes/no + score):"
)
c.relevance_score = self.parse_score(relevance)
scored.append(c)
return [c for c in scored if c.relevance_score > 0.5]
def add_for_recall(self, sub_q: str,
selected: list[Evidence]) -> list[Evidence]:
gaps = self.llm.generate(
f"What evidence is missing to answer '{sub_q}'?\n"
f"Current evidence: {[e.text for e in selected]}"
)
new_queries = self.parse_gap_queries(gaps)
additions = []
for q in new_queries:
results = self.retriever.search(q, top_k=3)
additions.extend(self.select_for_precision(sub_q, results))
return selected + additions
def answer(self, question: str, max_iterations: int = 3) -> str:
sub_questions = self.decompose_question(question)
all_evidence = []
for sub_q in sub_questions:
candidates = self.retriever.search(sub_q, top_k=10)
selected = self.select_for_precision(sub_q, candidates)
for _ in range(max_iterations):
enriched = self.add_for_recall(sub_q, selected)
if len(enriched) == len(selected):
break
selected = enriched
all_evidence.extend(selected)
return self.llm.generate(
f"Answer based on evidence:\n"
f"Q: {question}\nEvidence: {[e.text for e in all_evidence]}"
)
===== Architecture =====
graph TD
A[Multi-Hop Question] --> B[Question Analyzer]
B --> C[Sub-Question 1]
B --> D[Sub-Question 2]
B --> E[Sub-Question N]
C --> F[Initial Retrieval]
D --> F
E --> F
F --> G[Selector Agent - Precision]
G --> H[Filtered Evidence Set]
H --> I[Adder Agent - Recall]
I --> J{Gaps Found?}
J -->|Yes| K[New Retrieval Queries]
K --> G
J -->|No| L[Final Evidence Set]
L --> M[QA Reader Model]
M --> N[Answer]
subgraph MA-RAG Extension
O[Agent 1 CoT] --> P[Verification]
Q[Agent 2 CoT] --> P
R[Agent K CoT] --> P
P --> S[Consensus Answer]
end
===== Key Results =====
^ Dataset ^ PRISM Recall ^ Baseline Recall ^ PRISM QA ^
| HotpotQA | 90.9% | 61.5-72.8% | SOTA |
| 2WikiMultiHop | 91.1% | 68.1-90.7% | Competitive |
| MuSiQue | High | Lower | SOTA (surpasses full context) |
| MultiHopRAG | High | Baselines | SOTA (surpasses full context) |
PRISM reduces retrieved tokens by 50-80% while maintaining or exceeding full-context QA accuracy, demonstrating that precise evidence selection is more effective than feeding entire documents to readers.
===== See Also =====
* [[budget_aware_reasoning|Budget-Aware Reasoning]]
* [[software_testing_agents|Software Testing Agents]]
* [[robotic_manipulation_agents|Robotic Manipulation Agents]]
===== References =====