Chain-of-Verification (CoVe)

Chain-of-Verification (CoVe) is a prompting method introduced by Dhuliawala et al. at Meta in 2023 that systematically reduces hallucinations in large language models through structured self-verification¹⁾. Rather than relying on external knowledge bases, CoVe leverages the LLM's own ability to fact-check its outputs by decomposing verification into independent sub-tasks.

graph TD A[User Query] --> B[Stage 1: Generate Draft Response] B --> C[Stage 2: Plan Verification Questions] C --> D[Stage 3: Answer Qs Independently] D --> E[Stage 4: Check Consistency] E --> F{Errors Found?} F -->|Yes| G[Revise Response] F -->|No| H[Final Verified Response] G --> H

Overview

LLMs frequently generate plausible but factually incorrect content, a phenomenon known as hallucination. CoVe addresses this by introducing a four-stage pipeline that forces the model to critically examine its own draft response before producing a final answer. The key insight is that LLMs answer short, focused verification questions more accurately than they generate complex, multi-fact responses.

The Four-Stage Process

CoVe decomposes response generation into four sequential stages:

Stage 1, Baseline Response Generation: The LLM produces an initial draft answer to the user query. This response may contain hallucinated facts due to reliance on parametric knowledge.
Stage 2, Verification Planning: Given the query and baseline response, the LLM generates a set of targeted verification questions designed to probe key factual claims (e.g., “Was person really born in city?”).
Stage 3, Independent Verification Execution: The LLM answers each verification question independently, without access to the baseline response or other verification answers. This isolation mitigates confirmation bias.
Stage 4, Final Verified Response: The LLM synthesizes the baseline response with all verification answers, revising and correcting factual errors to produce a refined output.

The independence constraint in Stage 3 is critical. When the model can see its original response while answering verification questions, it tends to simply confirm its earlier claims, defeating the purpose of verification.

Variants

Three variants of CoVe have been proposed²⁾:

Joint CoVe, All four stages are executed in a single prompt. Simplest but least effective due to confirmation bias.
2-Step CoVe, Verification execution is separated into individual prompts per question.
Factored CoVe, Verification answers are fully isolated from the baseline response. Most effective variant.

Formal Description

Given a query $q$, the process can be formalized as:

$$R_0 = \text{LLM}(q)$$

$$V = \{v_1, v_2, \ldots, v_k\} = \text{Plan}(q, R_0)$$

$$a_i = \text{LLM}(v_i) \quad \forall i \in \{1, \ldots, k\}$$

$$R_{\text{final}} = \text{Revise}(q, R_0, \{(v_i, a_i)\})$$

The independence constraint requires that $a_i$ is generated without conditioning on $R_0$ or any $a_j$ where $j \neq i$.

Code Example

import [[openai|openai]]
 
def chain_of_verification(query, client):
    # Stage 1: Generate baseline response
    baseline = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": query}]
    ).choices[0].message.content
 
    # Stage 2: Plan verification questions
    plan_prompt = (
        f"Given this query: {query}\n"
        f"And this draft response: {baseline}\n"
        "List verification questions to fact-check key claims."
    )
    questions_raw = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": plan_prompt}]
    ).choices[0].message.content
    questions = [q.strip() for q in questions_raw.strip().split("\n") if q.strip()]
 
    # Stage 3: Independently answer each verification question
    answers = {}
    for vq in questions:
        ans = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": vq}]
        ).choices[0].message.content
        answersvq = ans
 
    # Stage 4: Generate final verified response
    verification_context = "\n".join(
        f"Q: {q}\nA: {a}" for q, a in answers.items()
    )
    revise_prompt = (
        f"Original query: {query}\n"
        f"Draft response: {baseline}\n"
        f"Verification results:\n{verification_context}\n"
        "Revise the draft to correct any factual errors found."
    )
    final = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": revise_prompt}]
    ).choices[0].message.content
 
    return final

Experimental Results

CoVe was evaluated on hallucination-prone tasks using Meta's LLaMA models:

Task	Metric	Baseline	CoVe (Factored)
List-based QA (Wikidata)	Precision	Low	Significant improvement
Closed-book MultiSpanQA	F1	0.39	0.48 (+23%)
Longform Generation	FactScore	60.8	63.7

Key findings:

Factored CoVe consistently outperforms Joint and 2-Step variants
Verification questions are answered more accurately than the same facts appear in baseline responses
CoVe outperforms Chain-of-Thought, Self-CheckGPT, and CoT+ReVISE baselines

References

¹⁾

Dhuliawala et al. "Chain-of-Verification Reduces Hallucination in Large Language Models" arXiv:2309.11495, 2023

²⁾

ACL Findings 2024 publication

Table of Contents