This is an old revision of the document!

LLM Hallucination

LLM Hallucination refers to the phenomenon where large language models generate content that is plausible-sounding but factually incorrect, internally inconsistent, or unfaithful to provided context. The comprehensive survey by Huang et al. (2023) establishes a systematic taxonomy of hallucination types, causes, detection methods, and mitigation strategies across the full LLM development lifecycle.

Overview

As LLMs are deployed in high-stakes applications (medicine, law, finance), hallucination represents a critical reliability challenge. Unlike traditional NLP errors that are often obviously wrong, LLM hallucinations are fluent and confident, making them particularly dangerous. The survey provides a unified framework for understanding and addressing this problem.

Taxonomy of Hallucination Types

The survey distinguishes two primary categories:

Factuality Hallucination: The model generates content that contradicts established real-world facts. This includes:
- Factual contradictions: Statements that are verifiably false (e.g., incorrect dates, wrong attributions)
- Factual fabrications: Invented entities, events, or relationships that do not exist
Faithfulness Hallucination: The model's output is inconsistent with the provided input context, instructions, or its own prior statements. This includes:
- Input divergence: Summaries or translations that add or omit information
- Context inconsistency: Contradictions within a single response or conversation
- Instruction non-compliance: Outputs that ignore or violate explicit constraints

These can further be classified as intrinsic (contradictions with training data) or extrinsic (contradictions with external facts not in training data).

Causes of Hallucination

The survey identifies causes across three levels of the LLM development cycle:

Data-Level Causes

Deficiencies in knowledge memorization – models fail to reliably store facts from training data
Poor recall of “torso and tail” facts – less common knowledge is disproportionately hallucinated
Vague knowledge boundaries – the model cannot distinguish what it knows from what it does not
Training data quality issues – noise, contradictions, and outdated information in corpora

Training-Level Causes

Exposure bias: During training, models see ground-truth tokens; during inference, they condition on their own (potentially erroneous) outputs, causing error accumulation
RLHF complications: Reinforcement learning from human feedback can inadvertently reward fluent but unfaithful outputs, as human raters sometimes prefer confident-sounding text
Black-box optimization dynamics that obscure how training produces hallucination-prone representations

Inference-Level Causes

Attention dilution: As sequence length increases, soft attention becomes spread across more positions, degrading recall of specific facts
Probabilistic generation: Sampling-based decoding prioritizes fluency and coherence over factual accuracy
Reasoning failures: Both short-range and long-range dependency errors in multi-step reasoning

Formal Characterization

Hallucination can be formally characterized as a divergence between generated text $y$ and a reference knowledge set $K$:

$$\text{Hallucination}(y) = \{c \in \text{Claims}(y) : c \notin K \lor \neg\text{Verify}(c, K)\}$$

For faithfulness, the reference is the input context $x$:

$$\text{Faithfulness}(y, x) = 1 - \frac{|\{c \in \text{Claims}(y) : \text{Entailed}(c, x)\}|}{|\text{Claims}(y)|}$$

Detection Methods

Retrieval-Based Detection

External knowledge sources verify model outputs against factual databases. The model's claims are extracted, relevant documents retrieved, and each claim checked for support. Limitations include incomplete knowledge bases and retrieval errors.

Model-Based Detection

Supervised approaches: Trained classifiers using attention maps and token probability features to predict claim-level hallucination
Spectral methods: HalluShift measures distribution shifts in hidden states (AUCROC 89.9% on TruthfulQA); LapEigvals models attention as graph Laplacian (AUCROC 88.9% on TriviaQA)
Internal-state methods: PRISM uses prompt-guided hidden states as features for detection with strong cross-domain generalization
Self-consistency: Generate multiple responses and flag claims that appear inconsistently across samples

Human Evaluation

Expert annotation remains the gold standard but is expensive and not scalable. Used primarily for benchmark creation and method validation.

Code Example

import openai
from collections import Counter
 
def detect_hallucination_self_consistency(query, client, n_samples=5, threshold=0.6):
    """Detect potential hallucinations via self-consistency checking."""
    responses = []
    for _ in range(n_samples):
        resp = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": query}],
            temperature=0.7
        )
        responses.append(resp.choices[0].message.content)
 
    # Extract claims from each response
    claims_per_response = []
    for resp in responses:
        extract_prompt = (
            f"Extract all factual claims from this text as a numbered list:\n{resp}"
        )
        claims_raw = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": extract_prompt}],
            temperature=0
        ).choices[0].message.content
        claims_per_response.append(claims_raw)
 
    # Check consistency across samples
    check_prompt = (
        "Given these claim sets from multiple responses to the same query, "
        "identify claims that appear inconsistently (potential hallucinations):\n\n"
        + "\n---\n".join(claims_per_response)
    )
    inconsistencies = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": check_prompt}],
        temperature=0
    ).choices[0].message.content
 
    return {"responses": responses, "inconsistencies": inconsistencies}

Mitigation Strategies

Reinforcement Learning from Human Feedback (RLHF)

Training reward models to penalize hallucinated content. Effective but can introduce its own biases – models may learn to hedge rather than be accurate.

Retrieval-Augmented Generation (RAG)

Grounding model outputs in retrieved documents reduces factual hallucination by providing explicit evidence. However, RAG alone is insufficient when retrieval quality is poor or the question requires reasoning beyond retrieved facts.

Self-Consistency and Verification

Generating multiple candidate responses and selecting the most consistent answer. Related techniques include Chain-of-Verification (CoVe) and self-reflection prompting.

Decoding Strategies

Over-confidence penalty during generation to prevent the model from committing too strongly to uncertain claims
Retrospective allocation strategies that redistribute probability mass
Constrained decoding that enforces factual grounding

Citation Mechanisms

Requiring models to cite sources for claims, analogous to web search attribution. This makes hallucinations easier to detect and provides an accountability mechanism.

Key Benchmarks

Benchmark	Type	Description
TruthfulQA	Factuality	Tests whether models produce truthful answers to adversarial questions
HaluEval	Discrimination	Requires models to identify whether statements contain hallucinations
FACTOR	Likelihood	Tests whether models assign higher probability to factual vs. non-factual statements
FActScore	Atomic facts	Decomposes generations into atomic facts and verifies each against Wikipedia

AI Agent Knowledge Base

Sidebar

Table of Contents

LLM Hallucination

Overview

Taxonomy of Hallucination Types

Causes of Hallucination

Data-Level Causes

Training-Level Causes

Inference-Level Causes

Formal Characterization

Detection Methods

Retrieval-Based Detection

Model-Based Detection

Human Evaluation

Code Example

Mitigation Strategies

Reinforcement Learning from Human Feedback (RLHF)

Retrieval-Augmented Generation (RAG)

Self-Consistency and Verification

Decoding Strategies

Citation Mechanisms

Key Benchmarks

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

LLM Hallucination

Overview

Taxonomy of Hallucination Types

Causes of Hallucination

Data-Level Causes

Training-Level Causes

Inference-Level Causes

Formal Characterization

Detection Methods

Retrieval-Based Detection

Model-Based Detection

Human Evaluation

Code Example

Mitigation Strategies

Reinforcement Learning from Human Feedback (RLHF)

Retrieval-Augmented Generation (RAG)

Self-Consistency and Verification

Decoding Strategies

Citation Mechanisms

Key Benchmarks

References

See Also

Page Tools