Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
A practical troubleshooting guide for diagnosing and fixing hallucination in LLM-based agents. Hallucination occurs when an agent generates plausible but factually incorrect outputs — from wrong dates and fake citations to invented API behaviors.
Unlike simple LLM hallucination, agent hallucination compounds across tool calls, planning steps, and multi-turn interactions. A 2024 study from the Chinese Academy of Sciences cataloged agent-specific hallucination taxonomies, finding that agents suffer from unique failure modes beyond base model confabulation.1)
Key statistics:
Agents parse tool outputs incorrectly, fabricating details from ambiguous or noisy data. A Stanford study on legal RAG tools found agents frequently hallucinate by being unfaithful to retrieved data.3)
Symptoms: Agent cites specific numbers or facts that don't appear in tool output. Confident answers that contradict the data returned.
When conversation history, tool results, and instructions exceed the token limit, critical information gets truncated silently.
Symptoms: Agent “forgets” earlier instructions. Answers become increasingly incoherent in long sessions. Tool results from early in the conversation are ignored.
Vague prompts like “find recent breakthroughs” invite the model to fill gaps with fabricated content.
Symptoms: Agent invents specific dates, names, or URLs. Responses contain plausible-sounding but unverifiable claims.
Without external verification, agents rely purely on parametric knowledge, which is probabilistic by nature.
Symptoms: Answers sound authoritative but contain subtle errors. Model never says “I don't know.”
Autoregressive generation means early errors cascade — each wrong token increases the probability of subsequent wrong tokens.4)
Symptoms: Responses start correctly but drift into fabrication. Longer outputs are less accurate than shorter ones.
High temperature or top-p settings increase randomness, making hallucination more likely. Softmax overconfidence in multi-peak distributions compounds the problem.
Anchor agent responses in retrieved documents. This is the single most effective mitigation.
from langchain.chains import RetrievalQA from langchain_openai import ChatOpenAI, OpenAIEmbeddings from langchain_community.vectorstores import Chroma # Ground every answer in retrieved documents embeddings = OpenAIEmbeddings(model="text-embedding-3-small") vectorstore = Chroma(persist_directory="./db", embedding_function=embeddings) retriever = vectorstore.as_retriever(search_kwargs={"k": 5}) llm = ChatOpenAI(model="gpt-4o", temperature=0.1) # Low temp reduces hallucination qa_chain = RetrievalQA.from_chain_type( llm=llm, retriever=retriever, return_source_documents=True, # Always return sources for verification chain_type_kwargs={ "prompt": PromptTemplate( template="""Answer based ONLY on the following context. If the context doesn't contain the answer, say "I don't have enough information." Context: {context} Question: {question} Answer:""", input_variables=["context", "question"] ) } )
The model drafts a response, generates verification questions, answers them independently, then produces a final verified response. Published at ACL 2024 by Meta AI and ETH Zurich (Dhuliawala et al.).5)
import openai def chain_of_verification(query: str, initial_answer: str, client) -> str: """Implement Chain-of-Verification to reduce hallucination.""" # Step 1: Generate verification questions verification_prompt = f"""Given this answer to the question "{query}": Answer: {initial_answer} Generate 3-5 specific factual claims that can be independently verified. Format each as a yes/no verification question.""" verification_resp = client.chat.completions.create( model="gpt-4o", temperature=0.0, messages=[{"role": "user", "content": verification_prompt}] ) questions = verification_resp.choices[0].message.content # Step 2: Answer each verification question independently verify_prompt = f"""Answer each question independently with YES, NO, or UNCERTAIN. Do NOT refer to any previous answer. Use only your knowledge. {questions}""" verify_resp = client.chat.completions.create( model="gpt-4o", temperature=0.0, messages=[{"role": "user", "content": verify_prompt}] ) verifications = verify_resp.choices[0].message.content # Step 3: Generate corrected final answer final_prompt = f"""Original question: {query} Draft answer: {initial_answer} Verification results: {verifications} Produce a corrected final answer. Remove any claims that failed verification. If uncertain, state what is uncertain.""" final_resp = client.chat.completions.create( model="gpt-4o", temperature=0.0, messages=[{"role": "user", "content": final_prompt}] ) return final_resp.choices[0].message.content
Generate multiple responses and select the majority answer. Effective for reasoning tasks.
from collections import Counter def self_consistency_check(query: str, client, n_samples: int = 5) -> str: """Generate multiple answers and return the most consistent one.""" answers = [] for _ in range(n_samples): resp = client.chat.completions.create( model="gpt-4o", temperature=0.7, # Need variance for diversity messages=[{"role": "user", "content": query}] ) answers.append(resp.choices[0].message.content) # Use LLM to cluster similar answers and pick majority cluster_prompt = f"""Given these {n_samples} answers to "{query}": {chr(10).join(f'{i+1}. {a}' for i, a in enumerate(answers))} Group similar answers. Return the answer that appears most frequently. If answers disagree on facts, flag the disagreement.""" result = client.chat.completions.create( model="gpt-4o", temperature=0.0, messages=[{"role": "user", "content": cluster_prompt}] ) return result.choices[0].message.content
Lower temperature (0.0-0.3) for factual tasks. Higher temperature increases hallucination risk.
Restrict output to valid tokens using JSON schemas, regex patterns, or grammar constraints.
from pydantic import BaseModel from openai import OpenAI class VerifiedAnswer(BaseModel): answer: str confidence: float # 0.0 to 1.0 sources: list[str] caveats: list[str] client = OpenAI() response = client.beta.chat.completions.parse( model="gpt-4o", messages=[{"role": "user", "content": "What is the population of Tokyo?"}], response_format=VerifiedAnswer, temperature=0.0 ) # Model is forced to populate confidence and caveats fields # Low confidence flags likely hallucination
import numpy as np from sentence_transformers import SentenceTransformer class HallucinationDetector: """Detect potential hallucination by comparing agent output against source documents.""" def __init__(self, model_name: str = "all-MiniLM-L6-v2"): self.model = SentenceTransformer(model_name) self.threshold = 0.3 # Below this = likely hallucination def check(self, agent_output: str, source_documents: list[str]) -> dict: """Compare agent output sentences against source docs.""" # Split output into individual claims claims = [s.strip() for s in agent_output.split('.') if len(s.strip()) > 10] source_text = " ".join(source_documents) source_embedding = self.model.encode([source_text]) results = [] for claim in claims: claim_embedding = self.model.encode([claim]) similarity = np.dot(claim_embedding[0], source_embedding[0]) / ( np.linalg.norm(claim_embedding[0]) * np.linalg.norm(source_embedding[0]) ) results.append({ "claim": claim, "similarity": float(similarity), "likely_hallucinated": similarity < self.threshold }) hallucinated = [r for r in results if r["likely_hallucinated"]] return { "total_claims": len(results), "hallucinated_claims": len(hallucinated), "hallucination_rate": len(hallucinated) / max(len(results), 1), "details": results } # Usage detector = HallucinationDetector() result = detector.check( agent_output="Tokyo has a population of 14 million. It was founded in 1457.", source_documents=["Tokyo, population 13.96 million, is the capital of Japan."] ) print(f"Hallucination rate: {result['hallucination_rate']:.0%}")