Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Code & Software
Safety & Security
Evaluation
Research
Development
Meta
Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Code & Software
Safety & Security
Evaluation
Research
Development
Meta
Chain-of-Verification (CoVe) is a prompting method introduced by Dhuliawala et al. at Meta in 2023 that systematically reduces hallucinations in large language models through structured self-verification. Rather than relying on external knowledge bases, CoVe leverages the LLM's own ability to fact-check its outputs by decomposing verification into independent sub-tasks.
LLMs frequently generate plausible but factually incorrect content – a phenomenon known as hallucination. CoVe addresses this by introducing a four-stage pipeline that forces the model to critically examine its own draft response before producing a final answer. The key insight is that LLMs answer short, focused verification questions more accurately than they generate complex, multi-fact responses.
CoVe decomposes response generation into four sequential stages:
The independence constraint in Stage 3 is critical. When the model can see its original response while answering verification questions, it tends to simply confirm its earlier claims – defeating the purpose of verification.
Three variants of CoVe have been proposed:
Given a query $q$, the process can be formalized as:
$$R_0 = \text{LLM}(q)$$
$$V = \{v_1, v_2, \ldots, v_k\} = \text{Plan}(q, R_0)$$
$$a_i = \text{LLM}(v_i) \quad \forall i \in \{1, \ldots, k\}$$
$$R_{\text{final}} = \text{Revise}(q, R_0, \{(v_i, a_i)\})$$
The independence constraint requires that $a_i$ is generated without conditioning on $R_0$ or any $a_j$ where $j \neq i$.
import openai def chain_of_verification(query, client): # Stage 1: Generate baseline response baseline = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": query}] ).choices[0].message.content # Stage 2: Plan verification questions plan_prompt = ( f"Given this query: {query}\n" f"And this draft response: {baseline}\n" "List verification questions to fact-check key claims." ) questions_raw = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": plan_prompt}] ).choices[0].message.content questions = [q.strip() for q in questions_raw.strip().split("\n") if q.strip()] # Stage 3: Independently answer each verification question answers = {} for vq in questions: ans = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": vq}] ).choices[0].message.content answers[vq] = ans # Stage 4: Generate final verified response verification_context = "\n".join( f"Q: {q}\nA: {a}" for q, a in answers.items() ) revise_prompt = ( f"Original query: {query}\n" f"Draft response: {baseline}\n" f"Verification results:\n{verification_context}\n" "Revise the draft to correct any factual errors found." ) final = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": revise_prompt}] ).choices[0].message.content return final
CoVe was evaluated on hallucination-prone tasks using Meta's LLaMA models:
| Task | Metric | Baseline | CoVe (Factored) |
|---|---|---|---|
| List-based QA (Wikidata) | Precision | Low | Significant improvement |
| Closed-book MultiSpanQA | F1 | 0.39 | 0.48 (+23%) |
| Longform Generation | FactScore | 60.8 | 63.7 |
Key findings: