====== Chain-of-Verification (CoVe) ======
**Chain-of-Verification (CoVe)** is a prompting method introduced by Dhuliawala et al. at Meta in 2023 that systematically reduces hallucinations in large language models through structured self-verification. Rather than relying on external knowledge bases, CoVe leverages the LLM's own ability to fact-check its outputs by decomposing verification into independent sub-tasks.
graph TD
A[User Query] --> B[Stage 1: Generate Draft Response]
B --> C[Stage 2: Plan Verification Questions]
C --> D[Stage 3: Answer Qs Independently]
D --> E[Stage 4: Check Consistency]
E --> F{Errors Found?}
F -->|Yes| G[Revise Response]
F -->|No| H[Final Verified Response]
G --> H
===== Overview =====
LLMs frequently generate plausible but factually incorrect content -- a phenomenon known as hallucination. CoVe addresses this by introducing a four-stage pipeline that forces the model to critically examine its own draft response before producing a final answer. The key insight is that LLMs answer short, focused verification questions more accurately than they generate complex, multi-fact responses.
===== The Four-Stage Process =====
CoVe decomposes response generation into four sequential stages:
- **Stage 1 -- Baseline Response Generation**: The LLM produces an initial draft answer to the user query. This response may contain hallucinated facts due to reliance on parametric knowledge.
- **Stage 2 -- Verification Planning**: Given the query and baseline response, the LLM generates a set of targeted verification questions designed to probe key factual claims (e.g., "Was [person] really born in [city]?").
- **Stage 3 -- Independent Verification Execution**: The LLM answers each verification question //independently//, without access to the baseline response or other verification answers. This isolation mitigates confirmation bias.
- **Stage 4 -- Final Verified Response**: The LLM synthesizes the baseline response with all verification answers, revising and correcting factual errors to produce a refined output.
The independence constraint in Stage 3 is critical. When the model can see its original response while answering verification questions, it tends to simply confirm its earlier claims -- defeating the purpose of verification.
===== Variants =====
Three variants of CoVe have been proposed:
* **Joint CoVe** -- All four stages are executed in a single prompt. Simplest but least effective due to confirmation bias.
* **2-Step CoVe** -- Verification execution is separated into individual prompts per question.
* **Factored CoVe** -- Verification answers are fully isolated from the baseline response. Most effective variant.
===== Formal Description =====
Given a query $q$, the process can be formalized as:
$$R_0 = \text{LLM}(q)$$
$$V = \{v_1, v_2, \ldots, v_k\} = \text{Plan}(q, R_0)$$
$$a_i = \text{LLM}(v_i) \quad \forall i \in \{1, \ldots, k\}$$
$$R_{\text{final}} = \text{Revise}(q, R_0, \{(v_i, a_i)\})$$
The independence constraint requires that $a_i$ is generated without conditioning on $R_0$ or any $a_j$ where $j \neq i$.
===== Code Example =====
import openai
def chain_of_verification(query, client):
# Stage 1: Generate baseline response
baseline = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": query}]
).choices[0].message.content
# Stage 2: Plan verification questions
plan_prompt = (
f"Given this query: {query}\n"
f"And this draft response: {baseline}\n"
"List verification questions to fact-check key claims."
)
questions_raw = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": plan_prompt}]
).choices[0].message.content
questions = [q.strip() for q in questions_raw.strip().split("\n") if q.strip()]
# Stage 3: Independently answer each verification question
answers = {}
for vq in questions:
ans = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": vq}]
).choices[0].message.content
answers[vq] = ans
# Stage 4: Generate final verified response
verification_context = "\n".join(
f"Q: {q}\nA: {a}" for q, a in answers.items()
)
revise_prompt = (
f"Original query: {query}\n"
f"Draft response: {baseline}\n"
f"Verification results:\n{verification_context}\n"
"Revise the draft to correct any factual errors found."
)
final = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": revise_prompt}]
).choices[0].message.content
return final
===== Experimental Results =====
CoVe was evaluated on hallucination-prone tasks using Meta's LLaMA models:
^ Task ^ Metric ^ Baseline ^ CoVe (Factored) ^
| List-based QA (Wikidata) | Precision | Low | Significant improvement |
| Closed-book MultiSpanQA | F1 | 0.39 | 0.48 (+23%) |
| Longform Generation | FactScore | 60.8 | 63.7 |
Key findings:
* Factored CoVe consistently outperforms Joint and 2-Step variants
* Verification questions are answered more accurately than the same facts appear in baseline responses
* CoVe outperforms Chain-of-Thought, Self-CheckGPT, and CoT+ReVISE baselines
===== References =====
* [[https://arxiv.org/abs/2309.11495|Dhuliawala et al., "Chain-of-Verification Reduces Hallucination in Large Language Models", arXiv:2309.11495 (2023)]]
* [[https://aclanthology.org/2024.findings-acl.212/|ACL Findings 2024 publication]]
===== See Also =====
* [[llm_hallucination|LLM Hallucination Survey]]
* [[step_back_prompting|Step-Back Prompting]]
* [[least_to_most_prompting|Least-to-Most Prompting]]