Table of Contents

Multi-Hop QA Agents

Agentic retrieval systems for multi-hop question answering deploy specialized LLM agents to iteratively decompose queries, filter evidence for precision, and recover missing facts for recall, dramatically outperforming single-step RAG on complex reasoning tasks.

Overview

Multi-hop question answering requires synthesizing information from multiple documents to answer questions that cannot be resolved with a single retrieval step. Standard RAG (Retrieval-Augmented Generation) retrieves passages once and feeds them to an LLM reader, but this approach often misses crucial evidence or includes distractors that degrade QA performance. PRISM1) introduces an agentic Precision-Recall Iterative Selection Mechanism, while MA-RAG2) deploys collaborative chain-of-thought agents for multi-hop reasoning.

PRISM: Agentic Retrieval for Multi-Hop QA

PRISM deploys three specialized LLM agents in an iterative loop:

The Selector-Adder loop typically runs 2-3 iterations, balancing precision and recall without excessively expanding context.

Fact-wise Precision and Recall: Evaluated at the sentence level:

<latex>\text{Precision} = \frac{|\text{Supporting Sentences} \cap \text{Retrieved}|}{|\text{Retrieved}|}</latex>

<latex>\text{Recall} = \frac{|\text{Supporting Sentences} \cap \text{Retrieved}|}{|\text{Gold Supporting Sentences}|}</latex>

End-to-end QA is measured by Exact Match (EM) and F1 using only the filtered evidence versus full context or gold passages.

MA-RAG: Collaborative Chain-of-Thought

MA-RAG extends multi-hop QA with collaborative chain-of-thought reasoning among multiple LLM agents:

The collaborative verification process reduces hallucination by requiring agent consensus on reasoning paths:

<latex>P(\text{fact correct}) = 1 - \prod_{i=1}^{K} (1 - p_i)</latex>

where $p_i$ is the confidence of agent $i$ in a fact, and $K$ agents must independently verify each claim.

Beyond Single-Step RAG

The key limitations of single-step RAG that agentic approaches address:

Agentic approaches reduce irrelevant tokens by 50-80% while boosting QA EM/F1, with PRISM achieving 90%+ recall versus 60-70% for single-step methods.

Code Example

from dataclasses import dataclass
 
@dataclass
class Evidence:
    text: str
    source: str
    relevance_score: float
    is_supporting: bool = False
 
class PRISMAgent:
    def __init__(self, llm, retriever):
        self.llm = llm
        self.retriever = retriever
 
    def decompose_question(self, question: str) -> list[str]:
        sub_questions = self.llm.generate(
            f"Decompose into sub-questions for multi-hop:\n"
            f"Q: {question}\nSub-questions:"
        )
        return self.parse_sub_questions(sub_questions)
 
    def select_for_precision(self, sub_q: str,
                             candidates: list[Evidence]) -> list[Evidence]:
        scored = []
        for c in candidates:
            relevance = self.llm.generate(
                f"Is this passage relevant to '{sub_q}'?\n"
                f"Passage: {c.text}\nAnswer (yes/no + score):"
            )
            c.relevance_score = self.parse_score(relevance)
            scored.append(c)
        return [c for c in scored if c.relevance_score > 0.5]
 
    def add_for_recall(self, sub_q: str,
                       selected: list[Evidence]) -> list[Evidence]:
        gaps = self.llm.generate(
            f"What evidence is missing to answer '{sub_q}'?\n"
            f"Current evidence: {[e.text for e in selected]}"
        )
        new_queries = self.parse_gap_queries(gaps)
        additions = []
        for q in new_queries:
            results = self.retriever.search(q, top_k=3)
            additions.extend(self.select_for_precision(sub_q, results))
        return selected + additions
 
    def answer(self, question: str, max_iterations: int = 3) -> str:
        sub_questions = self.decompose_question(question)
        all_evidence = []
        for sub_q in sub_questions:
            candidates = self.retriever.search(sub_q, top_k=10)
            selected = self.select_for_precision(sub_q, candidates)
            for _ in range(max_iterations):
                enriched = self.add_for_recall(sub_q, selected)
                if len(enriched) == len(selected):
                    break
                selected = enriched
            all_evidence.extend(selected)
        return self.llm.generate(
            f"Answer based on evidence:\n"
            f"Q: {question}\nEvidence: {[e.text for e in all_evidence]}"
        )

Architecture

graph TD A[Multi-Hop Question] --> B[Question Analyzer] B --> C[Sub-Question 1] B --> D[Sub-Question 2] B --> E[Sub-Question N] C --> F[Initial Retrieval] D --> F E --> F F --> G[Selector Agent - Precision] G --> H[Filtered Evidence Set] H --> I[Adder Agent - Recall] I --> J{Gaps Found?} J -->|Yes| K[New Retrieval Queries] K --> G J -->|No| L[Final Evidence Set] L --> M[QA Reader Model] M --> N[Answer] subgraph MA-RAG Extension O[Agent 1 CoT] --> P[Verification] Q[Agent 2 CoT] --> P R[Agent K CoT] --> P P --> S[Consensus Answer] end

Key Results

Dataset PRISM Recall Baseline Recall PRISM QA
HotpotQA 90.9% 61.5-72.8% SOTA
2WikiMultiHop 91.1% 68.1-90.7% Competitive
MuSiQue High Lower SOTA (surpasses full context)
MultiHopRAG High Baselines SOTA (surpasses full context)

PRISM reduces retrieved tokens by 50-80% while maintaining or exceeding full-context QA accuracy, demonstrating that precise evidence selection is more effective than feeding entire documents to readers.

See Also

References