====== Why Is My Agent Hallucinating? ======
A practical troubleshooting guide for diagnosing and fixing hallucination in LLM-based agents. Hallucination occurs when an agent generates plausible but factually incorrect outputs — from wrong dates and fake citations to invented API behaviors.
===== Understanding Agent Hallucination =====
Unlike simple LLM hallucination, **agent hallucination** compounds across tool calls, planning steps, and multi-turn interactions. A 2024 study from the Chinese Academy of Sciences cataloged agent-specific hallucination taxonomies, finding that agents suffer from unique failure modes beyond base model confabulation.(([[https://arxiv.org/html/2509.18970v1|Lin et al., "LLM-based Agents Suffer from Hallucinations: A Survey," arXiv 2025]]))
**Key statistics:**
* Base LLMs hallucinate at least 20% on rare facts(([[https://cdn.openai.com/pdf/d04913be-3f6f-4d2b-b283-ff432ef4aaa5/why-language-models-hallucinate.pdf|OpenAI, "Why Language Models Hallucinate," 2025]]))
* Clinical QA systems showed 63% hallucination rate without grounding, dropping to 1.7% with ontology grounding (Votek, 2025)
* ~50% of hallucinations recur on repeated prompts; 60% resurface within 10 retries (Trends Research, 2024)
* GPT-5-thinking-mini reduced errors from 75% to 26% via post-training, but at the cost of high refusal rates (InfoQ, 2025)
===== Root Causes =====
==== 1. Tool Result Misinterpretation ====
Agents parse tool outputs incorrectly, fabricating details from ambiguous or noisy data. A Stanford study on legal RAG tools found agents frequently hallucinate by being unfaithful to retrieved data.(([[https://dho.stanford.edu/wp-content/uploads/Legal_RAG_Hallucinations.pdf|Stanford Digital Humanities, "Legal RAG Hallucinations," 2024]]))
**Symptoms:** Agent cites specific numbers or facts that don't appear in tool output. Confident answers that contradict the data returned.
==== 2. Context Window Overflow ====
When conversation history, tool results, and instructions exceed the token limit, critical information gets truncated silently.
**Symptoms:** Agent "forgets" earlier instructions. Answers become increasingly incoherent in long sessions. Tool results from early in the conversation are ignored.
==== 3. Ambiguous Instructions ====
Vague prompts like "find recent breakthroughs" invite the model to fill gaps with fabricated content.
**Symptoms:** Agent invents specific dates, names, or URLs. Responses contain plausible-sounding but unverifiable claims.
==== 4. Missing Grounding ====
Without external verification, agents rely purely on parametric knowledge, which is probabilistic by nature.
**Symptoms:** Answers sound authoritative but contain subtle errors. Model never says "I don't know."
==== 5. Exposure Bias (Snowball Effect) ====
Autoregressive generation means early errors cascade — each wrong token increases the probability of subsequent wrong tokens.(([[https://www.ox.ac.uk/news/2024-06-20-major-research-hallucinating-generative-models-advances-reliability-artificial|Oxford University, "Major Research on Hallucinating Generative Models," 2024]]))
**Symptoms:** Responses start correctly but drift into fabrication. Longer outputs are less accurate than shorter ones.
==== 6. Decoding Strategy Issues ====
High temperature or top-p settings increase randomness, making hallucination more likely. Softmax overconfidence in multi-peak distributions compounds the problem.
===== Diagnostic Flowchart =====
graph TD
A[Agent producing wrong output] --> B{Is the correct info in tool results?}
B -->|Yes| C{Does agent cite it correctly?}
B -->|No| D[Retrieval/Tool Problem]
C -->|Yes| E[Not hallucination - logic error]
C -->|No| F[Tool Misinterpretation]
D --> G{Is the data in your knowledge base?}
G -->|Yes| H[Fix retrieval - see RAG guide]
G -->|No| I[Add data source or ground truth]
F --> J{Context window near limit?}
J -->|Yes| K[Context Overflow - Compress or summarize]
J -->|No| L{Temperature > 0.7?}
L -->|Yes| M[Lower temperature to 0.1-0.3]
L -->|No| N[Add verification chain]
A --> O{Is output totally fabricated?}
O -->|Yes| P{Are instructions ambiguous?}
P -->|Yes| Q[Make instructions specific and constrained]
P -->|No| R[Missing grounding - Add RAG or tools]
===== Fixes =====
==== Fix 1: RAG Grounding ====
Anchor agent responses in retrieved documents. This is the single most effective mitigation.
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
# Ground every answer in retrieved documents
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(persist_directory="./db", embedding_function=embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
llm = ChatOpenAI(model="gpt-4o", temperature=0.1) # Low temp reduces hallucination
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
return_source_documents=True, # Always return sources for verification
chain_type_kwargs={
"prompt": PromptTemplate(
template="""Answer based ONLY on the following context.
If the context doesn't contain the answer, say "I don't have enough information."
Context: {context}
Question: {question}
Answer:""",
input_variables=["context", "question"]
)
}
)
==== Fix 2: Chain-of-Verification (CoVe) ====
The model drafts a response, generates verification questions, answers them independently, then produces a final verified response. Published at ACL 2024 by Meta AI and ETH Zurich (Dhuliawala et al.).(([[https://aclanthology.org/2024.findings-acl.212.pdf|Dhuliawala et al., "Chain-of-Verification Reduces Hallucination in Large Language Models," ACL Findings 2024]]))
import openai
def chain_of_verification(query: str, initial_answer: str, client) -> str:
"""Implement Chain-of-Verification to reduce hallucination."""
# Step 1: Generate verification questions
verification_prompt = f"""Given this answer to the question "{query}":
Answer: {initial_answer}
Generate 3-5 specific factual claims that can be independently verified.
Format each as a yes/no verification question."""
verification_resp = client.chat.completions.create(
model="gpt-4o", temperature=0.0,
messages=[{"role": "user", "content": verification_prompt}]
)
questions = verification_resp.choices[0].message.content
# Step 2: Answer each verification question independently
verify_prompt = f"""Answer each question independently with YES, NO, or UNCERTAIN.
Do NOT refer to any previous answer. Use only your knowledge.
{questions}"""
verify_resp = client.chat.completions.create(
model="gpt-4o", temperature=0.0,
messages=[{"role": "user", "content": verify_prompt}]
)
verifications = verify_resp.choices[0].message.content
# Step 3: Generate corrected final answer
final_prompt = f"""Original question: {query}
Draft answer: {initial_answer}
Verification results: {verifications}
Produce a corrected final answer. Remove any claims that failed verification.
If uncertain, state what is uncertain."""
final_resp = client.chat.completions.create(
model="gpt-4o", temperature=0.0,
messages=[{"role": "user", "content": final_prompt}]
)
return final_resp.choices[0].message.content
==== Fix 3: Self-Consistency (Sample and Vote) ====
Generate multiple responses and select the majority answer. Effective for reasoning tasks.
from collections import Counter
def self_consistency_check(query: str, client, n_samples: int = 5) -> str:
"""Generate multiple answers and return the most consistent one."""
answers = []
for _ in range(n_samples):
resp = client.chat.completions.create(
model="gpt-4o", temperature=0.7, # Need variance for diversity
messages=[{"role": "user", "content": query}]
)
answers.append(resp.choices[0].message.content)
# Use LLM to cluster similar answers and pick majority
cluster_prompt = f"""Given these {n_samples} answers to "{query}":
{chr(10).join(f'{i+1}. {a}' for i, a in enumerate(answers))}
Group similar answers. Return the answer that appears most frequently.
If answers disagree on facts, flag the disagreement."""
result = client.chat.completions.create(
model="gpt-4o", temperature=0.0,
messages=[{"role": "user", "content": cluster_prompt}]
)
return result.choices[0].message.content
==== Fix 4: Temperature Tuning ====
Lower temperature (0.0-0.3) for factual tasks. Higher temperature increases hallucination risk.
* **Factual Q&A:** temperature=0.0 to 0.1
* **Structured output:** temperature=0.0
* **Creative writing:** temperature=0.7 to 1.0 (hallucination acceptable)
==== Fix 5: Constrained Decoding ====
Restrict output to valid tokens using JSON schemas, regex patterns, or grammar constraints.
from pydantic import BaseModel
from openai import OpenAI
class VerifiedAnswer(BaseModel):
answer: str
confidence: float # 0.0 to 1.0
sources: list[str]
caveats: list[str]
client = OpenAI()
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[{"role": "user", "content": "What is the population of Tokyo?"}],
response_format=VerifiedAnswer,
temperature=0.0
)
# Model is forced to populate confidence and caveats fields
# Low confidence flags likely hallucination
===== Hallucination Detection Code =====
import numpy as np
from sentence_transformers import SentenceTransformer
class HallucinationDetector:
"""Detect potential hallucination by comparing agent output against source documents."""
def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
self.model = SentenceTransformer(model_name)
self.threshold = 0.3 # Below this = likely hallucination
def check(self, agent_output: str, source_documents: list[str]) -> dict:
"""Compare agent output sentences against source docs."""
# Split output into individual claims
claims = [s.strip() for s in agent_output.split('.') if len(s.strip()) > 10]
source_text = " ".join(source_documents)
source_embedding = self.model.encode([source_text])
results = []
for claim in claims:
claim_embedding = self.model.encode([claim])
similarity = np.dot(claim_embedding[0], source_embedding[0]) / (
np.linalg.norm(claim_embedding[0]) * np.linalg.norm(source_embedding[0])
)
results.append({
"claim": claim,
"similarity": float(similarity),
"likely_hallucinated": similarity < self.threshold
})
hallucinated = [r for r in results if r["likely_hallucinated"]]
return {
"total_claims": len(results),
"hallucinated_claims": len(hallucinated),
"hallucination_rate": len(hallucinated) / max(len(results), 1),
"details": results
}
# Usage
detector = HallucinationDetector()
result = detector.check(
agent_output="Tokyo has a population of 14 million. It was founded in 1457.",
source_documents=["Tokyo, population 13.96 million, is the capital of Japan."]
)
print(f"Hallucination rate: {result['hallucination_rate']:.0%}")
===== See Also =====
* [[why_is_my_rag_returning_bad_results|Why Is My RAG Returning Bad Results?]]
* [[common_agent_failure_modes|Common Agent Failure Modes]]
* [[how_to_handle_rate_limits|How to Handle Rate Limits]]
===== References =====