This is an old revision of the document!

Multi-Hop QA Agents

Agentic retrieval systems for multi-hop question answering deploy specialized LLM agents to iteratively decompose queries, filter evidence for precision, and recover missing facts for recall, dramatically outperforming single-step RAG on complex reasoning tasks.

Overview

Multi-hop question answering requires synthesizing information from multiple documents to answer questions that cannot be resolved with a single retrieval step. Standard RAG (Retrieval-Augmented Generation) retrieves passages once and feeds them to an LLM reader, but this approach often misses crucial evidence or includes distractors that degrade QA performance. PRISM introduces an agentic Precision-Recall Iterative Selection Mechanism, while MA-RAG deploys collaborative chain-of-thought agents for multi-hop reasoning.

PRISM: Agentic Retrieval for Multi-Hop QA

PRISM deploys three specialized LLM agents in an iterative loop:

Question Analyzer Agent: Decomposes multi-hop queries into sub-questions. For example, “Which city is the capital of the country where X was born?” becomes sub-questions about X's birth country and that country's capital.
Selector Agent: Reranks initial retrievals per sub-question for precision, filtering out distractor passages that could mislead the reader.
Adder Agent: Identifies gaps in the Selector's output and retrieves recall-focused additions via new queries or reranking of candidate passages.

The Selector-Adder loop typically runs 2-3 iterations, balancing precision and recall without excessively expanding context.

Fact-wise Precision and Recall: Evaluated at the sentence level:

<latex>\text{Precision} = \frac{|\text{Supporting Sentences} \cap \text{Retrieved}|}{|\text{Retrieved}|}</latex>

<latex>\text{Recall} = \frac{|\text{Supporting Sentences} \cap \text{Retrieved}|}{|\text{Gold Supporting Sentences}|}</latex>

End-to-end QA is measured by Exact Match (EM) and F1 using only the filtered evidence versus full context or gold passages.

MA-RAG: Collaborative Chain-of-Thought

MA-RAG extends multi-hop QA with collaborative chain-of-thought reasoning among multiple LLM agents:

Decomposition Agent: Breaks complex questions into reasoning chains
Retrieval Agent: Fetches evidence for each reasoning step
Verification Agent: Cross-checks retrieved facts across agents
Synthesis Agent: Combines verified evidence into a coherent answer

The collaborative verification process reduces hallucination by requiring agent consensus on reasoning paths:

<latex>P(\text{fact correct}) = 1 - \prod_{i=1}^{K} (1 - p_i)</latex>

where $p_i$ is the confidence of agent $i$ in a fact, and $K$ agents must independently verify each claim.

Beyond Single-Step RAG

The key limitations of single-step RAG that agentic approaches address:

No decomposition: Single-step RAG cannot break complex questions into sub-queries
No iteration: One retrieval pass often misses bridging entities needed for multi-hop reasoning
No filtering: Retrieved passages include distractors that mislead the reader model
No verification: No mechanism to check whether evidence actually supports the answer

Agentic approaches reduce irrelevant tokens by 50-80% while boosting QA EM/F1, with PRISM achieving 90%+ recall versus 60-70% for single-step methods.

Code Example

from dataclasses import dataclass
 
@dataclass
class Evidence:
    text: str
    source: str
    relevance_score: float
    is_supporting: bool = False
 
class PRISMAgent:
    def __init__(self, llm, retriever):
        self.llm = llm
        self.retriever = retriever
 
    def decompose_question(self, question: str) -> list[str]:
        sub_questions = self.llm.generate(
            f"Decompose into sub-questions for multi-hop:\n"
            f"Q: {question}\nSub-questions:"
        )
        return self.parse_sub_questions(sub_questions)
 
    def select_for_precision(self, sub_q: str,
                             candidates: list[Evidence]) -> list[Evidence]:
        scored = []
        for c in candidates:
            relevance = self.llm.generate(
                f"Is this passage relevant to '{sub_q}'?\n"
                f"Passage: {c.text}\nAnswer (yes/no + score):"
            )
            c.relevance_score = self.parse_score(relevance)
            scored.append(c)
        return [c for c in scored if c.relevance_score > 0.5]
 
    def add_for_recall(self, sub_q: str,
                       selected: list[Evidence]) -> list[Evidence]:
        gaps = self.llm.generate(
            f"What evidence is missing to answer '{sub_q}'?\n"
            f"Current evidence: {[e.text for e in selected]}"
        )
        new_queries = self.parse_gap_queries(gaps)
        additions = []
        for q in new_queries:
            results = self.retriever.search(q, top_k=3)
            additions.extend(self.select_for_precision(sub_q, results))
        return selected + additions
 
    def answer(self, question: str, max_iterations: int = 3) -> str:
        sub_questions = self.decompose_question(question)
        all_evidence = []
        for sub_q in sub_questions:
            candidates = self.retriever.search(sub_q, top_k=10)
            selected = self.select_for_precision(sub_q, candidates)
            for _ in range(max_iterations):
                enriched = self.add_for_recall(sub_q, selected)
                if len(enriched) == len(selected):
                    break
                selected = enriched
            all_evidence.extend(selected)
        return self.llm.generate(
            f"Answer based on evidence:\n"
            f"Q: {question}\nEvidence: {[e.text for e in all_evidence]}"
        )

Architecture

graph TD A[Multi-Hop Question] --> B[Question Analyzer] B --> C[Sub-Question 1] B --> D[Sub-Question 2] B --> E[Sub-Question N] C --> F[Initial Retrieval] D --> F E --> F F --> G[Selector Agent - Precision] G --> H[Filtered Evidence Set] H --> I[Adder Agent - Recall] I --> J{Gaps Found?} J -->|Yes| K[New Retrieval Queries] K --> G J -->|No| L[Final Evidence Set] L --> M[QA Reader Model] M --> N[Answer] subgraph MA-RAG Extension O[Agent 1 CoT] --> P[Verification] Q[Agent 2 CoT] --> P R[Agent K CoT] --> P P --> S[Consensus Answer] end

Key Results

Dataset	PRISM Recall	Baseline Recall	PRISM QA
HotpotQA	90.9%	61.5-72.8%	SOTA
2WikiMultiHop	91.1%	68.1-90.7%	Competitive
MuSiQue	High	Lower	SOTA (surpasses full context)
MultiHopRAG	High	Baselines	SOTA (surpasses full context)

PRISM reduces retrieved tokens by 50-80% while maintaining or exceeding full-context QA accuracy, demonstrating that precise evidence selection is more effective than feeding entire documents to readers.

AI Agent Knowledge Base

Sidebar

Table of Contents

Multi-Hop QA Agents

Overview

PRISM: Agentic Retrieval for Multi-Hop QA

MA-RAG: Collaborative Chain-of-Thought

Beyond Single-Step RAG

Code Example

Architecture

Key Results

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Multi-Hop QA Agents

Overview

PRISM: Agentic Retrieval for Multi-Hop QA

MA-RAG: Collaborative Chain-of-Thought

Beyond Single-Step RAG

Code Example

Architecture

Key Results

References

See Also

Page Tools