====== LLM Hallucination ======

**LLM Hallucination** refers to the phenomenon where large language models generate content that is plausible-sounding but factually incorrect, internally inconsistent, or unfaithful to provided context.(([[https://arxiv.org/abs/2311.05232|Huang et al., "A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions", arXiv:2311.05232 (2023]])) The comprehensive survey by Huang et al. (2023) establishes a systematic taxonomy of hallucination types, causes, detection methods, and mitigation strategies across the full LLM development lifecycle.


<mermaid>
graph TD
    A[LLM Output] --> B[Claim Extraction]
    B --> C[Evidence Retrieval]
    C --> D[Consistency Check]
    D --> E{Verdict}
    E -->|Consistent| F[Supported]
    E -->|Inconsistent| G[Hallucinated]
</mermaid>

===== Overview =====

As LLMs are deployed in high-stakes applications (medicine, law, finance), hallucination represents a critical reliability challenge. Unlike traditional NLP errors that are often obviously wrong, LLM hallucinations are fluent and confident, making them particularly dangerous.(([[https://arxiv.org/abs/2309.01219|Ji et al., "Survey of Hallucination in Natural Language Generation", ACM Computing Surveys (2023]])) The survey provides a unified framework for understanding and addressing this problem.

===== Taxonomy of Hallucination Types =====

The survey distinguishes two primary categories:(([[https://arxiv.org/abs/2311.05232|Huang et al., "A Survey on Hallucination in Large Language Models", arXiv:2311.05232 (2023]]))

  - **Factuality Hallucination**: The model generates content that contradicts established real-world facts. This includes:
    * **Factual contradictions**: Statements that are verifiably false (e.g., incorrect dates, wrong attributions)
    * **Factual fabrications**: Invented entities, events, or relationships that do not exist
  - **Faithfulness Hallucination**: The model's output is inconsistent with the provided input context, instructions, or its own prior statements. This includes:
    * **Input divergence**: Summaries or translations that add or omit information
    * **Context inconsistency**: Contradictions within a single response or conversation
    * **Instruction non-compliance**: Outputs that ignore or violate explicit constraints

These can further be classified as **intrinsic** (contradictions with training data) or **extrinsic** (contradictions with external facts not in training data).

===== Causes of Hallucination =====

The survey identifies causes across three levels of the LLM development cycle:

=== Data-Level Causes ===
  * Deficiencies in knowledge memorization, models fail to reliably store facts from training data
  * Poor recall of "torso and tail" facts, less common knowledge is disproportionately hallucinated
  * Vague knowledge boundaries, the model cannot distinguish what it knows from what it does not
  * Training data quality issues, noise, contradictions, and outdated information in corpora

=== Training-Level Causes ===
  * **Exposure bias**: During training, models see ground-truth tokens; during inference, they condition on their own (potentially erroneous) outputs, causing error accumulation
  * **RLHF complications**: [[rlhf|Reinforcement learning from human feedback]] can inadvertently reward fluent but unfaithful outputs, as human raters sometimes prefer confident-sounding text
  * Black-box optimization dynamics that obscure how training produces hallucination-prone representations

=== Inference-Level Causes ===
  * **Attention dilution**: As sequence length increases, soft attention becomes spread across more positions, degrading recall of specific facts
  * **Probabilistic generation**: Sampling-based decoding prioritizes fluency and coherence over factual accuracy
  * **Reasoning failures**: Both short-range and long-range dependency errors in multi-step reasoning

===== Formal Characterization =====

Hallucination can be formally characterized as a divergence between generated text $y$ and a reference knowledge set $K$:

$$\text{Hallucination}(y) = \{c \in \text{Claims}(y) : c \notin K \lor \neg\text{Verify}(c, K)\}$$

For faithfulness, the reference is the input context $x$:

$$\text{Faithfulness}(y, x) = 1 - \frac{|\{c \in \text{Claims}(y) : \text{Entailed}(c, x)\}|}{|\text{Claims}(y)|}$$

===== Detection Methods =====

=== Retrieval-Based Detection ===
External knowledge sources verify model outputs against factual databases. The model's claims are extracted, relevant documents retrieved, and each claim checked for support. Limitations include incomplete knowledge bases and retrieval errors.

=== Model-Based Detection ===
  * **Supervised approaches**: Trained classifiers using attention maps and token probability features to predict claim-level hallucination
  * **Spectral methods**: HalluShift measures distribution shifts in hidden states (AUCROC 89.9% on TruthfulQA); LapEigvals models attention as graph Laplacian (AUCROC 88.9% on TriviaQA)
  * **Internal-state methods**: PRISM uses prompt-guided hidden states as features for detection with strong cross-domain generalization
  * **[[self_consistency|Self-consistency]]**: Generate multiple responses and flag claims that appear inconsistently across samples

=== Human Evaluation ===
Expert annotation remains the gold standard but is expensive and not scalable. Used primarily for benchmark creation and method validation.

===== Code Example =====

<code python>
import [[openai|openai]]
from collections import Counter

def detect_hallucination_self_consistency(query, client, n_samples=5, threshold=0.6):
    """Detect potential hallucinations via [[self_consistency|self-consistency]] checking."""
    responses = []
    for _ in range(n_samples):
        resp = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": query}],
            temperature=0.7
        )
        responses.append(resp.choices[0].message.content)

    # Extract claims from each response
    claims_per_response = []
    for resp in responses:
        extract_prompt = (
            f"Extract all factual claims from this text as a numbered list:\n{resp}"
        )
        claims_raw = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": extract_prompt}],
            temperature=0
        ).choices[0].message.content
        claims_per_response.append(claims_raw)

    # Check consistency across samples
    check_prompt = (
        "Given these claim sets from multiple responses to the same query, "
        "identify claims that appear inconsistently (potential hallucinations):\n\n"
        + "\n---\n".join(claims_per_response)
    )
    inconsistencies = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": check_prompt}],
        temperature=0
    ).choices[0].message.content

    return {"responses": responses, "inconsistencies": inconsistencies}
</code>

===== Mitigation Strategies =====

=== Reinforcement Learning from Human Feedback (RLHF) ===
Training reward models to penalize hallucinated content. Effective but can introduce its own biases, models may learn to hedge rather than be accurate.

=== Retrieval-Augmented Generation (RAG) ===
Grounding model outputs in retrieved documents reduces factual hallucination by providing explicit evidence. However, RAG alone is insufficient when retrieval quality is poor or the question requires reasoning beyond retrieved facts.

=== Self-Consistency and Verification ===
Generating multiple candidate responses and selecting the most consistent answer. Related techniques include Chain-of-Verification (CoVe) and self-reflection prompting.

=== Decoding Strategies ===
  * Over-confidence penalty during generation to prevent the model from committing too strongly to uncertain claims
  * Retrospective allocation strategies that redistribute probability mass
  * Constrained decoding that enforces factual grounding

=== Citation Mechanisms ===
Requiring models to cite sources for claims, analogous to web search attribution. This makes hallucinations easier to detect and provides an accountability mechanism.

===== Key Benchmarks =====

^ Benchmark ^ Type ^ Description ^
| TruthfulQA | Factuality | Tests whether models produce truthful answers to adversarial questions |
| HaluEval | Discrimination | Requires models to identify whether statements contain hallucinations |
| FACTOR | Likelihood | Tests whether models assign higher probability to factual vs. non-factual statements |
| FActScore | Atomic facts | Decomposes generations into atomic facts and verifies each against Wikipedia |

===== See Also =====

  * [[why_is_my_agent_hallucinating|Why Is My Agent Hallucinating?]]
  * [[factual_inaccuracy_hallucination|Factual Inaccuracy Hallucination]]
  * [[chain_of_verification|Chain-of-Verification (CoVe)]]
  * [[hallucination_in_agents|Hallucination in AI Agents]]
  * [[fabricated_content_hallucination|Fabricated Content Hallucination]]

===== References =====