====== Multi-Hop QA Agents ======

Agentic retrieval systems for multi-hop question answering deploy specialized LLM agents to iteratively decompose queries, filter evidence for precision, and recover missing facts for recall, dramatically outperforming single-step RAG on complex reasoning tasks.

===== Overview =====

Multi-hop question answering requires synthesizing information from multiple documents to answer questions that cannot be resolved with a single retrieval step. Standard RAG (Retrieval-Augmented Generation) retrieves passages once and feeds them to an LLM reader, but this approach often misses crucial evidence or includes distractors that degrade QA performance. PRISM(([[https://arxiv.org/abs/2510.14278|"PRISM: Precision-Recall Iterative Selection Mechanism for Agentic Multi-Hop QA." arXiv:2510.14278, 2025.]])) introduces an agentic Precision-Recall Iterative Selection Mechanism, while MA-RAG(([[https://arxiv.org/abs/2505.20096|"MA-RAG: Multi-Agent Collaborative Chain-of-Thought for Retrieval-Augmented Generation." arXiv:2505.20096, 2025.]])) deploys collaborative chain-of-thought agents for multi-hop reasoning.

===== PRISM: Agentic Retrieval for Multi-Hop QA =====

PRISM deploys three specialized LLM agents in an iterative loop:

  * **Question Analyzer Agent**: Decomposes multi-hop queries into sub-questions. For example, "Which city is the capital of the country where X was born?" becomes sub-questions about X's birth country and that country's capital.
  * **Selector Agent**: Reranks initial retrievals per sub-question for **precision**, filtering out distractor passages that could mislead the reader.
  * **Adder Agent**: Identifies gaps in the Selector's output and retrieves **recall-focused** additions via new queries or reranking of candidate passages.

The Selector-Adder loop typically runs 2-3 iterations, balancing precision and recall without excessively expanding context.

**Fact-wise Precision and Recall**: Evaluated at the sentence level:

<latex>\text{Precision} = \frac{|\text{Supporting Sentences} \cap \text{Retrieved}|}{|\text{Retrieved}|}</latex>

<latex>\text{Recall} = \frac{|\text{Supporting Sentences} \cap \text{Retrieved}|}{|\text{Gold Supporting Sentences}|}</latex>

End-to-end QA is measured by Exact Match (EM) and F1 using only the filtered evidence versus full context or gold passages.

===== MA-RAG: Collaborative Chain-of-Thought =====

MA-RAG extends multi-hop QA with collaborative chain-of-thought reasoning among multiple LLM agents:

  * **Decomposition Agent**: Breaks complex questions into reasoning chains
  * **Retrieval Agent**: Fetches evidence for each reasoning step
  * **Verification Agent**: Cross-checks retrieved facts across agents
  * **Synthesis Agent**: Combines verified evidence into a coherent answer

The collaborative verification process reduces hallucination by requiring agent consensus on reasoning paths:

<latex>P(\text{fact correct}) = 1 - \prod_{i=1}^{K} (1 - p_i)</latex>

where $p_i$ is the confidence of agent $i$ in a fact, and $K$ agents must independently verify each claim.

===== Beyond Single-Step RAG =====

The key limitations of single-step RAG that agentic approaches address:

  * **No decomposition**: Single-step RAG cannot break complex questions into sub-queries
  * **No iteration**: One retrieval pass often misses bridging entities needed for multi-hop reasoning
  * **No filtering**: Retrieved passages include distractors that mislead the reader model
  * **No verification**: No mechanism to check whether evidence actually supports the answer

Agentic approaches reduce irrelevant tokens by 50-80% while boosting QA EM/F1, with PRISM achieving 90%+ recall versus 60-70% for single-step methods.

===== Code Example =====

<code python>
from dataclasses import dataclass

@dataclass
class Evidence:
    text: str
    source: str
    relevance_score: float
    is_supporting: bool = False

class PRISMAgent:
    def __init__(self, llm, retriever):
        self.llm = llm
        self.retriever = retriever

    def decompose_question(self, question: str) -> list[str]:
        sub_questions = self.llm.generate(
            f"Decompose into sub-questions for multi-hop:\n"
            f"Q: {question}\nSub-questions:"
        )
        return self.parse_sub_questions(sub_questions)

    def select_for_precision(self, sub_q: str,
                             candidates: list[Evidence]) -> list[Evidence]:
        scored = []
        for c in candidates:
            relevance = self.llm.generate(
                f"Is this passage relevant to '{sub_q}'?\n"
                f"Passage: {c.text}\nAnswer (yes/no + score):"
            )
            c.relevance_score = self.parse_score(relevance)
            scored.append(c)
        return [c for c in scored if c.relevance_score > 0.5]

    def add_for_recall(self, sub_q: str,
                       selected: list[Evidence]) -> list[Evidence]:
        gaps = self.llm.generate(
            f"What evidence is missing to answer '{sub_q}'?\n"
            f"Current evidence: {[e.text for e in selected]}"
        )
        new_queries = self.parse_gap_queries(gaps)
        additions = []
        for q in new_queries:
            results = self.retriever.search(q, top_k=3)
            additions.extend(self.select_for_precision(sub_q, results))
        return selected + additions

    def answer(self, question: str, max_iterations: int = 3) -> str:
        sub_questions = self.decompose_question(question)
        all_evidence = []
        for sub_q in sub_questions:
            candidates = self.retriever.search(sub_q, top_k=10)
            selected = self.select_for_precision(sub_q, candidates)
            for _ in range(max_iterations):
                enriched = self.add_for_recall(sub_q, selected)
                if len(enriched) == len(selected):
                    break
                selected = enriched
            all_evidence.extend(selected)
        return self.llm.generate(
            f"Answer based on evidence:\n"
            f"Q: {question}\nEvidence: {[e.text for e in all_evidence]}"
        )
</code>

===== Architecture =====

<mermaid>
graph TD
    A[Multi-Hop Question] --> B[Question Analyzer]
    B --> C[Sub-Question 1]
    B --> D[Sub-Question 2]
    B --> E[Sub-Question N]
    C --> F[Initial Retrieval]
    D --> F
    E --> F
    F --> G[Selector Agent - Precision]
    G --> H[Filtered Evidence Set]
    H --> I[Adder Agent - Recall]
    I --> J{Gaps Found?}
    J -->|Yes| K[New Retrieval Queries]
    K --> G
    J -->|No| L[Final Evidence Set]
    L --> M[QA Reader Model]
    M --> N[Answer]
    subgraph MA-RAG Extension
        O[Agent 1 CoT] --> P[Verification]
        Q[Agent 2 CoT] --> P
        R[Agent K CoT] --> P
        P --> S[Consensus Answer]
    end
</mermaid>

===== Key Results =====

^ Dataset ^ PRISM Recall ^ Baseline Recall ^ PRISM QA ^
| HotpotQA | 90.9% | 61.5-72.8% | SOTA |
| 2WikiMultiHop | 91.1% | 68.1-90.7% | Competitive |
| MuSiQue | High | Lower | SOTA (surpasses full context) |
| MultiHopRAG | High | Baselines | SOTA (surpasses full context) |

PRISM reduces retrieved tokens by 50-80% while maintaining or exceeding full-context QA accuracy, demonstrating that precise evidence selection is more effective than feeding entire documents to readers.


===== See Also =====

  * [[budget_aware_reasoning|Budget-Aware Reasoning]]
  * [[software_testing_agents|Software Testing Agents]]
  * [[robotic_manipulation_agents|Robotic Manipulation Agents]]

===== References =====