====== Chain-of-Verification (CoVe) ====== **Chain-of-Verification (CoVe)** is a prompting method introduced by Dhuliawala et al. at Meta in 2023 that systematically reduces hallucinations in large language models through structured self-verification. Rather than relying on external knowledge bases, CoVe leverages the LLM's own ability to fact-check its outputs by decomposing verification into independent sub-tasks. graph TD A[User Query] --> B[Stage 1: Generate Draft Response] B --> C[Stage 2: Plan Verification Questions] C --> D[Stage 3: Answer Qs Independently] D --> E[Stage 4: Check Consistency] E --> F{Errors Found?} F -->|Yes| G[Revise Response] F -->|No| H[Final Verified Response] G --> H ===== Overview ===== LLMs frequently generate plausible but factually incorrect content -- a phenomenon known as hallucination. CoVe addresses this by introducing a four-stage pipeline that forces the model to critically examine its own draft response before producing a final answer. The key insight is that LLMs answer short, focused verification questions more accurately than they generate complex, multi-fact responses. ===== The Four-Stage Process ===== CoVe decomposes response generation into four sequential stages: - **Stage 1 -- Baseline Response Generation**: The LLM produces an initial draft answer to the user query. This response may contain hallucinated facts due to reliance on parametric knowledge. - **Stage 2 -- Verification Planning**: Given the query and baseline response, the LLM generates a set of targeted verification questions designed to probe key factual claims (e.g., "Was [person] really born in [city]?"). - **Stage 3 -- Independent Verification Execution**: The LLM answers each verification question //independently//, without access to the baseline response or other verification answers. This isolation mitigates confirmation bias. - **Stage 4 -- Final Verified Response**: The LLM synthesizes the baseline response with all verification answers, revising and correcting factual errors to produce a refined output. The independence constraint in Stage 3 is critical. When the model can see its original response while answering verification questions, it tends to simply confirm its earlier claims -- defeating the purpose of verification. ===== Variants ===== Three variants of CoVe have been proposed: * **Joint CoVe** -- All four stages are executed in a single prompt. Simplest but least effective due to confirmation bias. * **2-Step CoVe** -- Verification execution is separated into individual prompts per question. * **Factored CoVe** -- Verification answers are fully isolated from the baseline response. Most effective variant. ===== Formal Description ===== Given a query $q$, the process can be formalized as: $$R_0 = \text{LLM}(q)$$ $$V = \{v_1, v_2, \ldots, v_k\} = \text{Plan}(q, R_0)$$ $$a_i = \text{LLM}(v_i) \quad \forall i \in \{1, \ldots, k\}$$ $$R_{\text{final}} = \text{Revise}(q, R_0, \{(v_i, a_i)\})$$ The independence constraint requires that $a_i$ is generated without conditioning on $R_0$ or any $a_j$ where $j \neq i$. ===== Code Example ===== import openai def chain_of_verification(query, client): # Stage 1: Generate baseline response baseline = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": query}] ).choices[0].message.content # Stage 2: Plan verification questions plan_prompt = ( f"Given this query: {query}\n" f"And this draft response: {baseline}\n" "List verification questions to fact-check key claims." ) questions_raw = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": plan_prompt}] ).choices[0].message.content questions = [q.strip() for q in questions_raw.strip().split("\n") if q.strip()] # Stage 3: Independently answer each verification question answers = {} for vq in questions: ans = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": vq}] ).choices[0].message.content answers[vq] = ans # Stage 4: Generate final verified response verification_context = "\n".join( f"Q: {q}\nA: {a}" for q, a in answers.items() ) revise_prompt = ( f"Original query: {query}\n" f"Draft response: {baseline}\n" f"Verification results:\n{verification_context}\n" "Revise the draft to correct any factual errors found." ) final = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": revise_prompt}] ).choices[0].message.content return final ===== Experimental Results ===== CoVe was evaluated on hallucination-prone tasks using Meta's LLaMA models: ^ Task ^ Metric ^ Baseline ^ CoVe (Factored) ^ | List-based QA (Wikidata) | Precision | Low | Significant improvement | | Closed-book MultiSpanQA | F1 | 0.39 | 0.48 (+23%) | | Longform Generation | FactScore | 60.8 | 63.7 | Key findings: * Factored CoVe consistently outperforms Joint and 2-Step variants * Verification questions are answered more accurately than the same facts appear in baseline responses * CoVe outperforms Chain-of-Thought, Self-CheckGPT, and CoT+ReVISE baselines ===== References ===== * [[https://arxiv.org/abs/2309.11495|Dhuliawala et al., "Chain-of-Verification Reduces Hallucination in Large Language Models", arXiv:2309.11495 (2023)]] * [[https://aclanthology.org/2024.findings-acl.212/|ACL Findings 2024 publication]] ===== See Also ===== * [[llm_hallucination|LLM Hallucination Survey]] * [[step_back_prompting|Step-Back Prompting]] * [[least_to_most_prompting|Least-to-Most Prompting]]