Overview
PRISM: Agentic Retrieval for Multi-Hop QA
MA-RAG: Collaborative Chain-of-Thought
Beyond Single-Step RAG
Code Example
Architecture
Key Results
See Also
References

Multi-Hop QA Agents

Agentic retrieval systems for multi-hop question answering deploy specialized LLM agents to iteratively decompose queries, filter evidence for precision, and recover missing facts for recall, dramatically outperforming single-step RAG on complex reasoning tasks.

Overview

Multi-hop question answering requires synthesizing information from multiple documents to answer questions that cannot be resolved with a single retrieval step. Standard RAG (Retrieval-Augmented Generation) retrieves passages once and feeds them to an LLM reader, but this approach often misses crucial evidence or includes distractors that degrade QA performance. PRISM¹⁾ introduces an agentic Precision-Recall Iterative Selection Mechanism, while MA-RAG²⁾ deploys collaborative chain-of-thought agents for multi-hop reasoning.

PRISM: Agentic Retrieval for Multi-Hop QA

PRISM deploys three specialized LLM agents in an iterative loop:

Question Analyzer Agent: Decomposes multi-hop queries into sub-questions. For example, “Which city is the capital of the country where X was born?” becomes sub-questions about X's birth country and that country's capital.
Selector Agent: Reranks initial retrievals per sub-question for precision, filtering out distractor passages that could mislead the reader.
Adder Agent: Identifies gaps in the Selector's output and retrieves recall-focused additions via new queries or reranking of candidate passages.

The Selector-Adder loop typically runs 2-3 iterations, balancing precision and recall without excessively expanding context.

Fact-wise Precision and Recall: Evaluated at the sentence level:

<latex>\text{Precision} = \frac{|\text{Supporting Sentences} \cap \text{Retrieved}|}{|\text{Retrieved}|}</latex>

<latex>\text{Recall} = \frac{|\text{Supporting Sentences} \cap \text{Retrieved}|}{|\text{Gold Supporting Sentences}|}</latex>

End-to-end QA is measured by Exact Match (EM) and F1 using only the filtered evidence versus full context or gold passages.

MA-RAG: Collaborative Chain-of-Thought

MA-RAG extends multi-hop QA with collaborative chain-of-thought reasoning among multiple LLM agents:

Decomposition Agent: Breaks complex questions into reasoning chains
Retrieval Agent: Fetches evidence for each reasoning step
Verification Agent: Cross-checks retrieved facts across agents
Synthesis Agent: Combines verified evidence into a coherent answer

The collaborative verification process reduces hallucination by requiring agent consensus on reasoning paths:

<latex>P(\text{fact correct}) = 1 - \prod_{i=1}^{K} (1 - p_i)</latex>

where $p_i$ is the confidence of agent $i$ in a fact, and $K$ agents must independently verify each claim.

Beyond Single-Step RAG

The key limitations of single-step RAG that agentic approaches address:

No decomposition: Single-step RAG cannot break complex questions into sub-queries
No iteration: One retrieval pass often misses bridging entities needed for multi-hop reasoning
No filtering: Retrieved passages include distractors that mislead the reader model
No verification: No mechanism to check whether evidence actually supports the answer

Agentic approaches reduce irrelevant tokens by 50-80% while boosting QA EM/F1, with PRISM achieving 90%+ recall versus 60-70% for single-step methods.

Code Example

from dataclasses import dataclass
 
@dataclass
class Evidence:
    text: str
    source: str
    relevance_score: float
    is_supporting: bool = False
 
class PRISMAgent:
    def __init__(self, llm, retriever):
        self.llm = llm
        self.retriever = retriever
 
    def decompose_question(self, question: str) -> list[str]:
        sub_questions = self.llm.generate(
            f"Decompose into sub-questions for multi-hop:\n"
            f"Q: {question}\nSub-questions:"
        )
        return self.parse_sub_questions(sub_questions)
 
    def select_for_precision(self, sub_q: str,
                             candidates: list[Evidence]) -> list[Evidence]:
        scored = []
        for c in candidates:
            relevance = self.llm.generate(
                f"Is this passage relevant to '{sub_q}'?\n"
                f"Passage: {c.text}\nAnswer (yes/no + score):"
            )
            c.relevance_score = self.parse_score(relevance)
            scored.append(c)
        return [c for c in scored if c.relevance_score > 0.5]
 
    def add_for_recall(self, sub_q: str,
                       selected: list[Evidence]) -> list[Evidence]:
        gaps = self.llm.generate(
            f"What evidence is missing to answer '{sub_q}'?\n"
            f"Current evidence: {[e.text for e in selected]}"
        )
        new_queries = self.parse_gap_queries(gaps)
        additions = []
        for q in new_queries:
            results = self.retriever.search(q, top_k=3)
            additions.extend(self.select_for_precision(sub_q, results))
        return selected + additions
 
    def answer(self, question: str, max_iterations: int = 3) -> str:
        sub_questions = self.decompose_question(question)
        all_evidence = []
        for sub_q in sub_questions:
            candidates = self.retriever.search(sub_q, top_k=10)
            selected = self.select_for_precision(sub_q, candidates)
            for _ in range(max_iterations):
                enriched = self.add_for_recall(sub_q, selected)
                if len(enriched) == len(selected):
                    break
                selected = enriched
            all_evidence.extend(selected)
        return self.llm.generate(
            f"Answer based on evidence:\n"
            f"Q: {question}\nEvidence: {[e.text for e in all_evidence]}"
        )

Architecture

graph TD A[Multi-Hop Question] --> B[Question Analyzer] B --> C[Sub-Question 1] B --> D[Sub-Question 2] B --> E[Sub-Question N] C --> F[Initial Retrieval] D --> F E --> F F --> G[Selector Agent - Precision] G --> H[Filtered Evidence Set] H --> I[Adder Agent - Recall] I --> J{Gaps Found?} J -->|Yes| K[New Retrieval Queries] K --> G J -->|No| L[Final Evidence Set] L --> M[QA Reader Model] M --> N[Answer] subgraph MA-RAG Extension O[Agent 1 CoT] --> P[Verification] Q[Agent 2 CoT] --> P R[Agent K CoT] --> P P --> S[Consensus Answer] end

Key Results

Dataset	PRISM Recall	Baseline Recall	PRISM QA
HotpotQA	90.9%	61.5-72.8%	SOTA
2WikiMultiHop	91.1%	68.1-90.7%	Competitive
MuSiQue	High	Lower	SOTA (surpasses full context)
MultiHopRAG	High	Baselines	SOTA (surpasses full context)

PRISM reduces retrieved tokens by 50-80% while maintaining or exceeding full-context QA accuracy, demonstrating that precise evidence selection is more effective than feeding entire documents to readers.

References

¹⁾

"PRISM: Precision-Recall Iterative Selection Mechanism for Agentic Multi-Hop QA." arXiv:2510.14278, 2025.

²⁾

"MA-RAG: Multi-Agent Collaborative Chain-of-Thought for Retrieval-Augmented Generation." arXiv:2505.20096, 2025.

Table of Contents