====== Why Is My Agent Hallucinating? ====== A practical troubleshooting guide for diagnosing and fixing hallucination in LLM-based agents. Hallucination occurs when an agent generates plausible but factually incorrect outputs — from wrong dates and fake citations to invented API behaviors. ===== Understanding Agent Hallucination ===== Unlike simple LLM hallucination, **agent hallucination** compounds across tool calls, planning steps, and multi-turn interactions. A 2024 study from the Chinese Academy of Sciences cataloged agent-specific hallucination taxonomies, finding that agents suffer from unique failure modes beyond base model confabulation.(([[https://arxiv.org/html/2509.18970v1|Lin et al., "LLM-based Agents Suffer from Hallucinations: A Survey," arXiv 2025]])) **Key statistics:** * Base LLMs hallucinate at least 20% on rare facts(([[https://cdn.openai.com/pdf/d04913be-3f6f-4d2b-b283-ff432ef4aaa5/why-language-models-hallucinate.pdf|OpenAI, "Why Language Models Hallucinate," 2025]])) * Clinical QA systems showed 63% hallucination rate without grounding, dropping to 1.7% with ontology grounding (Votek, 2025) * ~50% of hallucinations recur on repeated prompts; 60% resurface within 10 retries (Trends Research, 2024) * GPT-5-thinking-mini reduced errors from 75% to 26% via post-training, but at the cost of high refusal rates (InfoQ, 2025) ===== Root Causes ===== ==== 1. Tool Result Misinterpretation ==== Agents parse tool outputs incorrectly, fabricating details from ambiguous or noisy data. A Stanford study on legal RAG tools found agents frequently hallucinate by being unfaithful to retrieved data.(([[https://dho.stanford.edu/wp-content/uploads/Legal_RAG_Hallucinations.pdf|Stanford Digital Humanities, "Legal RAG Hallucinations," 2024]])) **Symptoms:** Agent cites specific numbers or facts that don't appear in tool output. Confident answers that contradict the data returned. ==== 2. Context Window Overflow ==== When conversation history, tool results, and instructions exceed the token limit, critical information gets truncated silently. **Symptoms:** Agent "forgets" earlier instructions. Answers become increasingly incoherent in long sessions. Tool results from early in the conversation are ignored. ==== 3. Ambiguous Instructions ==== Vague prompts like "find recent breakthroughs" invite the model to fill gaps with fabricated content. **Symptoms:** Agent invents specific dates, names, or URLs. Responses contain plausible-sounding but unverifiable claims. ==== 4. Missing Grounding ==== Without external verification, agents rely purely on parametric knowledge, which is probabilistic by nature. **Symptoms:** Answers sound authoritative but contain subtle errors. Model never says "I don't know." ==== 5. Exposure Bias (Snowball Effect) ==== Autoregressive generation means early errors cascade — each wrong token increases the probability of subsequent wrong tokens.(([[https://www.ox.ac.uk/news/2024-06-20-major-research-hallucinating-generative-models-advances-reliability-artificial|Oxford University, "Major Research on Hallucinating Generative Models," 2024]])) **Symptoms:** Responses start correctly but drift into fabrication. Longer outputs are less accurate than shorter ones. ==== 6. Decoding Strategy Issues ==== High temperature or top-p settings increase randomness, making hallucination more likely. Softmax overconfidence in multi-peak distributions compounds the problem. ===== Diagnostic Flowchart ===== graph TD A[Agent producing wrong output] --> B{Is the correct info in tool results?} B -->|Yes| C{Does agent cite it correctly?} B -->|No| D[Retrieval/Tool Problem] C -->|Yes| E[Not hallucination - logic error] C -->|No| F[Tool Misinterpretation] D --> G{Is the data in your knowledge base?} G -->|Yes| H[Fix retrieval - see RAG guide] G -->|No| I[Add data source or ground truth] F --> J{Context window near limit?} J -->|Yes| K[Context Overflow - Compress or summarize] J -->|No| L{Temperature > 0.7?} L -->|Yes| M[Lower temperature to 0.1-0.3] L -->|No| N[Add verification chain] A --> O{Is output totally fabricated?} O -->|Yes| P{Are instructions ambiguous?} P -->|Yes| Q[Make instructions specific and constrained] P -->|No| R[Missing grounding - Add RAG or tools] ===== Fixes ===== ==== Fix 1: RAG Grounding ==== Anchor agent responses in retrieved documents. This is the single most effective mitigation. from langchain.chains import RetrievalQA from langchain_openai import ChatOpenAI, OpenAIEmbeddings from langchain_community.vectorstores import Chroma # Ground every answer in retrieved documents embeddings = OpenAIEmbeddings(model="text-embedding-3-small") vectorstore = Chroma(persist_directory="./db", embedding_function=embeddings) retriever = vectorstore.as_retriever(search_kwargs={"k": 5}) llm = ChatOpenAI(model="gpt-4o", temperature=0.1) # Low temp reduces hallucination qa_chain = RetrievalQA.from_chain_type( llm=llm, retriever=retriever, return_source_documents=True, # Always return sources for verification chain_type_kwargs={ "prompt": PromptTemplate( template="""Answer based ONLY on the following context. If the context doesn't contain the answer, say "I don't have enough information." Context: {context} Question: {question} Answer:""", input_variables=["context", "question"] ) } ) ==== Fix 2: Chain-of-Verification (CoVe) ==== The model drafts a response, generates verification questions, answers them independently, then produces a final verified response. Published at ACL 2024 by Meta AI and ETH Zurich (Dhuliawala et al.).(([[https://aclanthology.org/2024.findings-acl.212.pdf|Dhuliawala et al., "Chain-of-Verification Reduces Hallucination in Large Language Models," ACL Findings 2024]])) import openai def chain_of_verification(query: str, initial_answer: str, client) -> str: """Implement Chain-of-Verification to reduce hallucination.""" # Step 1: Generate verification questions verification_prompt = f"""Given this answer to the question "{query}": Answer: {initial_answer} Generate 3-5 specific factual claims that can be independently verified. Format each as a yes/no verification question.""" verification_resp = client.chat.completions.create( model="gpt-4o", temperature=0.0, messages=[{"role": "user", "content": verification_prompt}] ) questions = verification_resp.choices[0].message.content # Step 2: Answer each verification question independently verify_prompt = f"""Answer each question independently with YES, NO, or UNCERTAIN. Do NOT refer to any previous answer. Use only your knowledge. {questions}""" verify_resp = client.chat.completions.create( model="gpt-4o", temperature=0.0, messages=[{"role": "user", "content": verify_prompt}] ) verifications = verify_resp.choices[0].message.content # Step 3: Generate corrected final answer final_prompt = f"""Original question: {query} Draft answer: {initial_answer} Verification results: {verifications} Produce a corrected final answer. Remove any claims that failed verification. If uncertain, state what is uncertain.""" final_resp = client.chat.completions.create( model="gpt-4o", temperature=0.0, messages=[{"role": "user", "content": final_prompt}] ) return final_resp.choices[0].message.content ==== Fix 3: Self-Consistency (Sample and Vote) ==== Generate multiple responses and select the majority answer. Effective for reasoning tasks. from collections import Counter def self_consistency_check(query: str, client, n_samples: int = 5) -> str: """Generate multiple answers and return the most consistent one.""" answers = [] for _ in range(n_samples): resp = client.chat.completions.create( model="gpt-4o", temperature=0.7, # Need variance for diversity messages=[{"role": "user", "content": query}] ) answers.append(resp.choices[0].message.content) # Use LLM to cluster similar answers and pick majority cluster_prompt = f"""Given these {n_samples} answers to "{query}": {chr(10).join(f'{i+1}. {a}' for i, a in enumerate(answers))} Group similar answers. Return the answer that appears most frequently. If answers disagree on facts, flag the disagreement.""" result = client.chat.completions.create( model="gpt-4o", temperature=0.0, messages=[{"role": "user", "content": cluster_prompt}] ) return result.choices[0].message.content ==== Fix 4: Temperature Tuning ==== Lower temperature (0.0-0.3) for factual tasks. Higher temperature increases hallucination risk. * **Factual Q&A:** temperature=0.0 to 0.1 * **Structured output:** temperature=0.0 * **Creative writing:** temperature=0.7 to 1.0 (hallucination acceptable) ==== Fix 5: Constrained Decoding ==== Restrict output to valid tokens using JSON schemas, regex patterns, or grammar constraints. from pydantic import BaseModel from openai import OpenAI class VerifiedAnswer(BaseModel): answer: str confidence: float # 0.0 to 1.0 sources: list[str] caveats: list[str] client = OpenAI() response = client.beta.chat.completions.parse( model="gpt-4o", messages=[{"role": "user", "content": "What is the population of Tokyo?"}], response_format=VerifiedAnswer, temperature=0.0 ) # Model is forced to populate confidence and caveats fields # Low confidence flags likely hallucination ===== Hallucination Detection Code ===== import numpy as np from sentence_transformers import SentenceTransformer class HallucinationDetector: """Detect potential hallucination by comparing agent output against source documents.""" def __init__(self, model_name: str = "all-MiniLM-L6-v2"): self.model = SentenceTransformer(model_name) self.threshold = 0.3 # Below this = likely hallucination def check(self, agent_output: str, source_documents: list[str]) -> dict: """Compare agent output sentences against source docs.""" # Split output into individual claims claims = [s.strip() for s in agent_output.split('.') if len(s.strip()) > 10] source_text = " ".join(source_documents) source_embedding = self.model.encode([source_text]) results = [] for claim in claims: claim_embedding = self.model.encode([claim]) similarity = np.dot(claim_embedding[0], source_embedding[0]) / ( np.linalg.norm(claim_embedding[0]) * np.linalg.norm(source_embedding[0]) ) results.append({ "claim": claim, "similarity": float(similarity), "likely_hallucinated": similarity < self.threshold }) hallucinated = [r for r in results if r["likely_hallucinated"]] return { "total_claims": len(results), "hallucinated_claims": len(hallucinated), "hallucination_rate": len(hallucinated) / max(len(results), 1), "details": results } # Usage detector = HallucinationDetector() result = detector.check( agent_output="Tokyo has a population of 14 million. It was founded in 1457.", source_documents=["Tokyo, population 13.96 million, is the capital of Japan."] ) print(f"Hallucination rate: {result['hallucination_rate']:.0%}") ===== See Also ===== * [[why_is_my_rag_returning_bad_results|Why Is My RAG Returning Bad Results?]] * [[common_agent_failure_modes|Common Agent Failure Modes]] * [[how_to_handle_rate_limits|How to Handle Rate Limits]] ===== References =====