Table of Contents

LLM Hallucination

LLM Hallucination refers to the phenomenon where large language models generate content that is plausible-sounding but factually incorrect, internally inconsistent, or unfaithful to provided context. The comprehensive survey by Huang et al. (2023) establishes a systematic taxonomy of hallucination types, causes, detection methods, and mitigation strategies across the full LLM development lifecycle.

graph TD A[LLM Output] --> B[Claim Extraction] B --> C[Evidence Retrieval] C --> D[Consistency Check] D --> E{Verdict} E -->|Consistent| F[Supported] E -->|Inconsistent| G[Hallucinated]

Overview

As LLMs are deployed in high-stakes applications (medicine, law, finance), hallucination represents a critical reliability challenge. Unlike traditional NLP errors that are often obviously wrong, LLM hallucinations are fluent and confident, making them particularly dangerous. The survey provides a unified framework for understanding and addressing this problem.

Taxonomy of Hallucination Types

The survey distinguishes two primary categories:

  1. Factuality Hallucination: The model generates content that contradicts established real-world facts. This includes:
    • Factual contradictions: Statements that are verifiably false (e.g., incorrect dates, wrong attributions)
    • Factual fabrications: Invented entities, events, or relationships that do not exist
  2. Faithfulness Hallucination: The model's output is inconsistent with the provided input context, instructions, or its own prior statements. This includes:
    • Input divergence: Summaries or translations that add or omit information
    • Context inconsistency: Contradictions within a single response or conversation
    • Instruction non-compliance: Outputs that ignore or violate explicit constraints

These can further be classified as intrinsic (contradictions with training data) or extrinsic (contradictions with external facts not in training data).

Causes of Hallucination

The survey identifies causes across three levels of the LLM development cycle:

Data-Level Causes

Training-Level Causes

Inference-Level Causes

Formal Characterization

Hallucination can be formally characterized as a divergence between generated text $y$ and a reference knowledge set $K$:

$$\text{Hallucination}(y) = \{c \in \text{Claims}(y) : c \notin K \lor \neg\text{Verify}(c, K)\}$$

For faithfulness, the reference is the input context $x$:

$$\text{Faithfulness}(y, x) = 1 - \frac{|\{c \in \text{Claims}(y) : \text{Entailed}(c, x)\}|}{|\text{Claims}(y)|}$$

Detection Methods

Retrieval-Based Detection

External knowledge sources verify model outputs against factual databases. The model's claims are extracted, relevant documents retrieved, and each claim checked for support. Limitations include incomplete knowledge bases and retrieval errors.

Model-Based Detection

Human Evaluation

Expert annotation remains the gold standard but is expensive and not scalable. Used primarily for benchmark creation and method validation.

Code Example

import openai
from collections import Counter
 
def detect_hallucination_self_consistency(query, client, n_samples=5, threshold=0.6):
    """Detect potential hallucinations via self-consistency checking."""
    responses = []
    for _ in range(n_samples):
        resp = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": query}],
            temperature=0.7
        )
        responses.append(resp.choices[0].message.content)
 
    # Extract claims from each response
    claims_per_response = []
    for resp in responses:
        extract_prompt = (
            f"Extract all factual claims from this text as a numbered list:\n{resp}"
        )
        claims_raw = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": extract_prompt}],
            temperature=0
        ).choices[0].message.content
        claims_per_response.append(claims_raw)
 
    # Check consistency across samples
    check_prompt = (
        "Given these claim sets from multiple responses to the same query, "
        "identify claims that appear inconsistently (potential hallucinations):\n\n"
        + "\n---\n".join(claims_per_response)
    )
    inconsistencies = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": check_prompt}],
        temperature=0
    ).choices[0].message.content
 
    return {"responses": responses, "inconsistencies": inconsistencies}

Mitigation Strategies

Reinforcement Learning from Human Feedback (RLHF)

Training reward models to penalize hallucinated content. Effective but can introduce its own biases – models may learn to hedge rather than be accurate.

Retrieval-Augmented Generation (RAG)

Grounding model outputs in retrieved documents reduces factual hallucination by providing explicit evidence. However, RAG alone is insufficient when retrieval quality is poor or the question requires reasoning beyond retrieved facts.

Self-Consistency and Verification

Generating multiple candidate responses and selecting the most consistent answer. Related techniques include Chain-of-Verification (CoVe) and self-reflection prompting.

Decoding Strategies

Citation Mechanisms

Requiring models to cite sources for claims, analogous to web search attribution. This makes hallucinations easier to detect and provides an accountability mechanism.

Key Benchmarks

Benchmark Type Description
TruthfulQA Factuality Tests whether models produce truthful answers to adversarial questions
HaluEval Discrimination Requires models to identify whether statements contain hallucinations
FACTOR Likelihood Tests whether models assign higher probability to factual vs. non-factual statements
FActScore Atomic facts Decomposes generations into atomic facts and verifies each against Wikipedia

References

See Also