====== Why Is My Agent Hallucinating? ======

A practical troubleshooting guide for diagnosing and fixing hallucination in LLM-based agents. Hallucination occurs when an agent generates plausible but factually incorrect outputs — from wrong dates and fake citations to invented API behaviors.

===== Understanding Agent Hallucination =====

Unlike simple LLM hallucination, **agent hallucination** compounds across tool calls, planning steps, and multi-turn interactions. A 2024 study from the Chinese Academy of Sciences cataloged agent-specific hallucination taxonomies, finding that agents suffer from unique failure modes beyond base model confabulation.(([[https://arxiv.org/html/2509.18970v1|Lin et al., "LLM-based Agents Suffer from Hallucinations: A Survey," arXiv 2025]]))

**Key statistics:**
  * Base LLMs hallucinate at least 20% on rare facts(([[https://cdn.openai.com/pdf/d04913be-3f6f-4d2b-b283-ff432ef4aaa5/why-language-models-hallucinate.pdf|OpenAI, "Why Language Models Hallucinate," 2025]]))
  * Clinical QA systems showed 63% hallucination rate without grounding, dropping to 1.7% with ontology grounding (Votek, 2025)
  * ~50% of hallucinations recur on repeated prompts; 60% resurface within 10 retries (Trends Research, 2024)
  * GPT-5-thinking-mini reduced errors from 75% to 26% via post-training, but at the cost of high refusal rates (InfoQ, 2025)

===== Root Causes =====

==== 1. Tool Result Misinterpretation ====

Agents parse tool outputs incorrectly, fabricating details from ambiguous or noisy data. A Stanford study on legal RAG tools found agents frequently hallucinate by being unfaithful to retrieved data.(([[https://dho.stanford.edu/wp-content/uploads/Legal_RAG_Hallucinations.pdf|Stanford Digital Humanities, "Legal RAG Hallucinations," 2024]]))

**Symptoms:** Agent cites specific numbers or facts that don't appear in tool output. Confident answers that contradict the data returned.

==== 2. Context Window Overflow ====

When conversation history, tool results, and instructions exceed the token limit, critical information gets truncated silently.

**Symptoms:** Agent "forgets" earlier instructions. Answers become increasingly incoherent in long sessions. Tool results from early in the conversation are ignored.

==== 3. Ambiguous Instructions ====

Vague prompts like "find recent breakthroughs" invite the model to fill gaps with fabricated content.

**Symptoms:** Agent invents specific dates, names, or URLs. Responses contain plausible-sounding but unverifiable claims.

==== 4. Missing Grounding ====

Without external verification, agents rely purely on parametric knowledge, which is probabilistic by nature.

**Symptoms:** Answers sound authoritative but contain subtle errors. Model never says "I don't know."

==== 5. Exposure Bias (Snowball Effect) ====

Autoregressive generation means early errors cascade — each wrong token increases the probability of subsequent wrong tokens.(([[https://www.ox.ac.uk/news/2024-06-20-major-research-hallucinating-generative-models-advances-reliability-artificial|Oxford University, "Major Research on Hallucinating Generative Models," 2024]]))

**Symptoms:** Responses start correctly but drift into fabrication. Longer outputs are less accurate than shorter ones.

==== 6. Decoding Strategy Issues ====

High temperature or top-p settings increase randomness, making hallucination more likely. Softmax overconfidence in multi-peak distributions compounds the problem.

===== Diagnostic Flowchart =====

<mermaid>
graph TD
    A[Agent producing wrong output] --> B{Is the correct info in tool results?}
    B -->|Yes| C{Does agent cite it correctly?}
    B -->|No| D[Retrieval/Tool Problem]
    C -->|Yes| E[Not hallucination - logic error]
    C -->|No| F[Tool Misinterpretation]
    D --> G{Is the data in your knowledge base?}
    G -->|Yes| H[Fix retrieval - see RAG guide]
    G -->|No| I[Add data source or ground truth]
    F --> J{Context window near limit?}
    J -->|Yes| K[Context Overflow - Compress or summarize]
    J -->|No| L{Temperature > 0.7?}
    L -->|Yes| M[Lower temperature to 0.1-0.3]
    L -->|No| N[Add verification chain]
    A --> O{Is output totally fabricated?}
    O -->|Yes| P{Are instructions ambiguous?}
    P -->|Yes| Q[Make instructions specific and constrained]
    P -->|No| R[Missing grounding - Add RAG or tools]
</mermaid>

===== Fixes =====

==== Fix 1: RAG Grounding ====

Anchor agent responses in retrieved documents. This is the single most effective mitigation.

<code python>
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Ground every answer in retrieved documents
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(persist_directory="./db", embedding_function=embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

llm = ChatOpenAI(model="gpt-4o", temperature=0.1)  # Low temp reduces hallucination

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True,  # Always return sources for verification
    chain_type_kwargs={
        "prompt": PromptTemplate(
            template="""Answer based ONLY on the following context.
If the context doesn't contain the answer, say "I don't have enough information."

Context: {context}
Question: {question}
Answer:""",
            input_variables=["context", "question"]
        )
    }
)
</code>

==== Fix 2: Chain-of-Verification (CoVe) ====

The model drafts a response, generates verification questions, answers them independently, then produces a final verified response. Published at ACL 2024 by Meta AI and ETH Zurich (Dhuliawala et al.).(([[https://aclanthology.org/2024.findings-acl.212.pdf|Dhuliawala et al., "Chain-of-Verification Reduces Hallucination in Large Language Models," ACL Findings 2024]]))

<code python>
import openai

def chain_of_verification(query: str, initial_answer: str, client) -> str:
    """Implement Chain-of-Verification to reduce hallucination."""

    # Step 1: Generate verification questions
    verification_prompt = f"""Given this answer to the question "{query}":
Answer: {initial_answer}

Generate 3-5 specific factual claims that can be independently verified.
Format each as a yes/no verification question."""

    verification_resp = client.chat.completions.create(
        model="gpt-4o", temperature=0.0,
        messages=[{"role": "user", "content": verification_prompt}]
    )
    questions = verification_resp.choices[0].message.content

    # Step 2: Answer each verification question independently
    verify_prompt = f"""Answer each question independently with YES, NO, or UNCERTAIN.
Do NOT refer to any previous answer. Use only your knowledge.

{questions}"""

    verify_resp = client.chat.completions.create(
        model="gpt-4o", temperature=0.0,
        messages=[{"role": "user", "content": verify_prompt}]
    )
    verifications = verify_resp.choices[0].message.content

    # Step 3: Generate corrected final answer
    final_prompt = f"""Original question: {query}
Draft answer: {initial_answer}
Verification results: {verifications}

Produce a corrected final answer. Remove any claims that failed verification.
If uncertain, state what is uncertain."""

    final_resp = client.chat.completions.create(
        model="gpt-4o", temperature=0.0,
        messages=[{"role": "user", "content": final_prompt}]
    )
    return final_resp.choices[0].message.content
</code>

==== Fix 3: Self-Consistency (Sample and Vote) ====

Generate multiple responses and select the majority answer. Effective for reasoning tasks.

<code python>
from collections import Counter

def self_consistency_check(query: str, client, n_samples: int = 5) -> str:
    """Generate multiple answers and return the most consistent one."""
    answers = []
    for _ in range(n_samples):
        resp = client.chat.completions.create(
            model="gpt-4o", temperature=0.7,  # Need variance for diversity
            messages=[{"role": "user", "content": query}]
        )
        answers.append(resp.choices[0].message.content)

    # Use LLM to cluster similar answers and pick majority
    cluster_prompt = f"""Given these {n_samples} answers to "{query}":
{chr(10).join(f'{i+1}. {a}' for i, a in enumerate(answers))}

Group similar answers. Return the answer that appears most frequently.
If answers disagree on facts, flag the disagreement."""

    result = client.chat.completions.create(
        model="gpt-4o", temperature=0.0,
        messages=[{"role": "user", "content": cluster_prompt}]
    )
    return result.choices[0].message.content
</code>

==== Fix 4: Temperature Tuning ====

Lower temperature (0.0-0.3) for factual tasks. Higher temperature increases hallucination risk.

  * **Factual Q&A:** temperature=0.0 to 0.1
  * **Structured output:** temperature=0.0
  * **Creative writing:** temperature=0.7 to 1.0 (hallucination acceptable)

==== Fix 5: Constrained Decoding ====

Restrict output to valid tokens using JSON schemas, regex patterns, or grammar constraints.

<code python>
from pydantic import BaseModel
from openai import OpenAI

class VerifiedAnswer(BaseModel):
    answer: str
    confidence: float  # 0.0 to 1.0
    sources: list[str]
    caveats: list[str]

client = OpenAI()
response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is the population of Tokyo?"}],
    response_format=VerifiedAnswer,
    temperature=0.0
)
# Model is forced to populate confidence and caveats fields
# Low confidence flags likely hallucination
</code>

===== Hallucination Detection Code =====

<code python>
import numpy as np
from sentence_transformers import SentenceTransformer

class HallucinationDetector:
    """Detect potential hallucination by comparing agent output against source documents."""

    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)
        self.threshold = 0.3  # Below this = likely hallucination

    def check(self, agent_output: str, source_documents: list[str]) -> dict:
        """Compare agent output sentences against source docs."""
        # Split output into individual claims
        claims = [s.strip() for s in agent_output.split('.') if len(s.strip()) > 10]
        source_text = " ".join(source_documents)
        source_embedding = self.model.encode([source_text])

        results = []
        for claim in claims:
            claim_embedding = self.model.encode([claim])
            similarity = np.dot(claim_embedding[0], source_embedding[0]) / (
                np.linalg.norm(claim_embedding[0]) * np.linalg.norm(source_embedding[0])
            )
            results.append({
                "claim": claim,
                "similarity": float(similarity),
                "likely_hallucinated": similarity < self.threshold
            })

        hallucinated = [r for r in results if r["likely_hallucinated"]]
        return {
            "total_claims": len(results),
            "hallucinated_claims": len(hallucinated),
            "hallucination_rate": len(hallucinated) / max(len(results), 1),
            "details": results
        }

# Usage
detector = HallucinationDetector()
result = detector.check(
    agent_output="Tokyo has a population of 14 million. It was founded in 1457.",
    source_documents=["Tokyo, population 13.96 million, is the capital of Japan."]
)
print(f"Hallucination rate: {result['hallucination_rate']:.0%}")
</code>

===== See Also =====

  * [[why_is_my_rag_returning_bad_results|Why Is My RAG Returning Bad Results?]]
  * [[common_agent_failure_modes|Common Agent Failure Modes]]
  * [[how_to_handle_rate_limits|How to Handle Rate Limits]]

===== References =====