====== Chain-of-Verification (CoVe) ======
**Chain-of-Verification (CoVe)** is a prompting method introduced by Dhuliawala et al. at [[meta|Meta]] in 2023 that systematically reduces hallucinations in large language models through structured self-verification(([[https://arxiv.org/abs/2309.11495|Dhuliawala et al. "Chain-of-Verification Reduces Hallucination in Large Language Models" arXiv:2309.11495, 2023]])). Rather than relying on external knowledge bases, CoVe leverages the LLM's own ability to fact-check its outputs by decomposing verification into independent sub-tasks.

<mermaid>
graph TD
    A[User Query] --> B[Stage 1: Generate Draft Response]
    B --> C[Stage 2: Plan Verification Questions]
    C --> D[Stage 3: Answer Qs Independently]
    D --> E[Stage 4: Check Consistency]
    E --> F{Errors Found?}
    F -->|Yes| G[Revise Response]
    F -->|No| H[Final Verified Response]
    G --> H
</mermaid>

===== Overview =====
LLMs frequently generate plausible but factually incorrect content, a phenomenon known as hallucination. CoVe addresses this by introducing a four-stage pipeline that forces the model to critically examine its own draft response before producing a final answer. The key insight is that LLMs answer short, focused verification questions more accurately than they generate complex, multi-fact responses.

===== The Four-Stage Process =====
CoVe decomposes response generation into four sequential stages:

  - **Stage 1, Baseline Response Generation**: The LLM produces an initial draft answer to the user query. This response may contain hallucinated facts due to reliance on parametric knowledge.
  - **Stage 2, Verification Planning**: Given the query and baseline response, the LLM generates a set of targeted verification questions designed to probe key factual claims (e.g., "Was person really born in city?").
  - **Stage 3, Independent Verification Execution**: The LLM answers each verification question //independently//, without access to the baseline response or other verification answers. This isolation mitigates confirmation bias.
  - **Stage 4, Final Verified Response**: The LLM synthesizes the baseline response with all verification answers, revising and correcting factual errors to produce a refined output.

The independence constraint in Stage 3 is critical. When the model can see its original response while answering verification questions, it tends to simply confirm its earlier claims, defeating the purpose of verification.

===== Variants =====
Three variants of CoVe have been proposed(([[https://aclanthology.org/2024.findings-acl.212/|ACL Findings 2024 publication]])):

  * **Joint CoVe**, All four stages are executed in a single prompt. Simplest but least effective due to confirmation bias.
  * **2-Step CoVe**, Verification execution is separated into individual prompts per question.
  * **Factored CoVe**, Verification answers are fully isolated from the baseline response. Most effective variant.

===== Formal Description =====
Given a query $q$, the process can be formalized as:

$$R_0 = \text{LLM}(q)$$

$$V = \{v_1, v_2, \ldots, v_k\} = \text{Plan}(q, R_0)$$

$$a_i = \text{LLM}(v_i) \quad \forall i \in \{1, \ldots, k\}$$

$$R_{\text{final}} = \text{Revise}(q, R_0, \{(v_i, a_i)\})$$

The independence constraint requires that $a_i$ is generated without conditioning on $R_0$ or any $a_j$ where $j \neq i$.

===== Code Example =====
<code python>
import [[openai|openai]]

def chain_of_verification(query, client):
    # Stage 1: Generate baseline response
    baseline = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": query}]
    ).choices[0].message.content

    # Stage 2: Plan verification questions
    plan_prompt = (
        f"Given this query: {query}\n"
        f"And this draft response: {baseline}\n"
        "List verification questions to fact-check key claims."
    )
    questions_raw = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": plan_prompt}]
    ).choices[0].message.content
    questions = [q.strip() for q in questions_raw.strip().split("\n") if q.strip()]

    # Stage 3: Independently answer each verification question
    answers = {}
    for vq in questions:
        ans = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": vq}]
        ).choices[0].message.content
        answersvq = ans

    # Stage 4: Generate final verified response
    verification_context = "\n".join(
        f"Q: {q}\nA: {a}" for q, a in answers.items()
    )
    revise_prompt = (
        f"Original query: {query}\n"
        f"Draft response: {baseline}\n"
        f"Verification results:\n{verification_context}\n"
        "Revise the draft to correct any factual errors found."
    )
    final = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": revise_prompt}]
    ).choices[0].message.content

    return final
</code>

===== Experimental Results =====
CoVe was evaluated on hallucination-prone tasks using [[meta|Meta]]'s LLaMA models:

^ Task ^ Metric ^ Baseline ^ CoVe (Factored) ^
| List-based QA (Wikidata) | Precision | Low | Significant improvement |
| Closed-book MultiSpanQA | F1 | 0.39 | 0.48 (+23%) |
| Longform Generation | FactScore | 60.8 | 63.7 |

Key findings:
  * Factored CoVe consistently outperforms Joint and 2-Step variants
  * Verification questions are answered more accurately than the same facts appear in baseline responses
  * CoVe outperforms Chain-of-Thought, Self-CheckGPT, and CoT+ReVISE baselines

===== See Also =====
  * [[llm_hallucination|LLM Hallucination]]
  * [[hallucination_in_agents|Hallucination in AI Agents]]
  * [[program_of_thoughts|Program of Thoughts]]

===== References =====