Chain-of-Verification (CoVe) is a prompting method introduced by Dhuliawala et al. at Meta in 2023 that systematically reduces hallucinations in large language models through structured self-verification1). Rather than relying on external knowledge bases, CoVe leverages the LLM's own ability to fact-check its outputs by decomposing verification into independent sub-tasks.
LLMs frequently generate plausible but factually incorrect content, a phenomenon known as hallucination. CoVe addresses this by introducing a four-stage pipeline that forces the model to critically examine its own draft response before producing a final answer. The key insight is that LLMs answer short, focused verification questions more accurately than they generate complex, multi-fact responses.
CoVe decomposes response generation into four sequential stages:
The independence constraint in Stage 3 is critical. When the model can see its original response while answering verification questions, it tends to simply confirm its earlier claims, defeating the purpose of verification.
Three variants of CoVe have been proposed2):
Given a query $q$, the process can be formalized as:
$$R_0 = \text{LLM}(q)$$
$$V = \{v_1, v_2, \ldots, v_k\} = \text{Plan}(q, R_0)$$
$$a_i = \text{LLM}(v_i) \quad \forall i \in \{1, \ldots, k\}$$
$$R_{\text{final}} = \text{Revise}(q, R_0, \{(v_i, a_i)\})$$
The independence constraint requires that $a_i$ is generated without conditioning on $R_0$ or any $a_j$ where $j \neq i$.
import [[openai|openai]] def chain_of_verification(query, client): # Stage 1: Generate baseline response baseline = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": query}] ).choices[0].message.content # Stage 2: Plan verification questions plan_prompt = ( f"Given this query: {query}\n" f"And this draft response: {baseline}\n" "List verification questions to fact-check key claims." ) questions_raw = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": plan_prompt}] ).choices[0].message.content questions = [q.strip() for q in questions_raw.strip().split("\n") if q.strip()] # Stage 3: Independently answer each verification question answers = {} for vq in questions: ans = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": vq}] ).choices[0].message.content answersvq = ans # Stage 4: Generate final verified response verification_context = "\n".join( f"Q: {q}\nA: {a}" for q, a in answers.items() ) revise_prompt = ( f"Original query: {query}\n" f"Draft response: {baseline}\n" f"Verification results:\n{verification_context}\n" "Revise the draft to correct any factual errors found." ) final = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": revise_prompt}] ).choices[0].message.content return final
CoVe was evaluated on hallucination-prone tasks using Meta's LLaMA models:
| Task | Metric | Baseline | CoVe (Factored) |
|---|---|---|---|
| List-based QA (Wikidata) | Precision | Low | Significant improvement |
| Closed-book MultiSpanQA | F1 | 0.39 | 0.48 (+23%) |
| Longform Generation | FactScore | 60.8 | 63.7 |
Key findings: