Self-Consistency

Self-Consistency is a decoding strategy introduced by Wang et al. (2022) that improves chain-of-thought (CoT) prompting by sampling multiple diverse reasoning paths from a language model and selecting the most consistent final answer via majority vote.¹⁾ The method replaces greedy decoding with stochastic sampling and marginalizes over reasoning chains, achieving substantial accuracy gains across arithmetic, commonsense, and symbolic reasoning benchmarks.²⁾

graph TD Q[Question] --> P1[Path 1: Reasoning A] Q --> P2[Path 2: Reasoning B] Q --> P3[Path 3: Reasoning C] Q --> P4[Path 4: Reasoning D] Q --> P5[Path 5: Reasoning E] P1 & P2 & P3 & P4 & P5 --> MV[Majority Vote] MV --> ANS[Final Answer]

Motivation

Standard chain-of-thought prompting uses greedy decoding, producing a single reasoning path. This is brittle: if the model makes an error anywhere in the chain, the final answer is wrong. The key insight is that complex problems typically admit multiple valid reasoning strategies that all lead to the same correct answer, while incorrect answers tend to arise from isolated, low-probability error paths.

Method

Self-consistency operates in three steps:

Prompt with CoT exemplars — Use standard few-shot chain-of-thought prompting (identical to regular CoT)
Sample diverse reasoning paths — Instead of greedy decoding, sample $k$ independent completions using temperature-based sampling
Majority vote — Extract the final answer from each path and select the answer that appears most frequently

$$a^* = \text{mode}\left(\{\text{extract}(r_i)\}_{i=1}^{k}\right)$$

where $r_i \sim p_\theta(\cdot | \text{prompt}, T)$ are sampled reasoning chains at temperature $T$.

Theoretical Intuition

The approach is grounded in the observation that diverse valid solution paths converge on correct answers while errors scatter across different wrong answers:

$$P(a^* = a_{\text{correct}}) \geq P(r_i \text{ yields } a_{\text{correct}})$$

This is analogous to ensemble methods in machine learning — the “wisdom of crowds” applied to reasoning chains from a single model. Self-consistency can also be viewed as an approximation to marginalization over the reasoning path:

$$P(a | q) = \sum_{r} P(a | r) \cdot P(r | q) \approx \frac{1}{k} \sum_{i=1}^{k} \mathbb{1}[\text{extract}(r_i) = a]$$

import collections
 
class SelfConsistency:
    def __init__(self, model, num_samples=40, temperature=0.7):
        self.model = model
        self.num_samples = num_samples
        self.temperature = temperature
 
    def solve(self, question, cot_exemplars):
        prompt = self._build_cot_prompt(question, cot_exemplars)
 
        # Sample k diverse reasoning paths (replaces greedy decoding)
        reasoning_paths = []
        for _ in range(self.num_samples):
            path = self.model.generate(
                prompt,
                temperature=self.temperature,
                max_tokens=512
            )
            reasoning_paths.append(path)
 
        # Extract final answer from each path
        answers = [self._extract_answer(path) for path in reasoning_paths]
 
        # Majority vote
        vote_counts = collections.Counter(answers)
        best_answer = vote_counts.most_common(1)[0][0]
 
        # Confidence = proportion of votes for winning answer
        confidence = vote_countsbest_answer / len(answers)
 
        return best_answer, confidence
 
    def _extract_answer(self, reasoning_path):
        # Parse the final "The answer is X" from the reasoning chain
        lines = reasoning_path.strip().split("\n")
        for line in reversed(lines):
            if "answer is" in line.lower():
                return line.split("answer is")[-1].strip().rstrip(".")
        return lines[-1].strip()
 
    def _build_cot_prompt(self, question, exemplars):
        prompt = ""
        for ex in exemplars:
            prompt += f"Q: {ex['question']}\nA: {ex['reasoning']}\n\n"
        prompt += f"Q: {question}\nA: Let's think step by step."
        return prompt

Relationship to Chain-of-Thought

Self-consistency is a pure decoding strategy layered on top of CoT:

Aspect	CoT (Greedy)	CoT + Self-Consistency
Prompting	Few-shot CoT exemplars	Same few-shot CoT exemplars
Decoding	Greedy (top-1 token)	Temperature sampling, $k$ paths
Answer selection	Single path output	Majority vote across $k$ paths
Cost	1x	$k$x (linear in samples)
Accuracy	Baseline	Significantly improved

The key innovation is entirely in the decoding strategy — no changes to prompts, model, or training.

Key Results

Self-consistency demonstrated substantial improvements across multiple benchmarks:

Benchmark	Task Type	Improvement over CoT
GSM8K	Arithmetic reasoning	+17.9%
SVAMP	Arithmetic	+11.0%
AQuA	Algebraic	+12.2%
MultiArith	Multi-step arithmetic	+24.0%
StrategyQA	Commonsense	+6.4%
ARC-challenge	Science reasoning	+3.9%
CommonsenseQA	Commonsense	+5.0%

The method achieved new state-of-the-art when used with PaLM-540B and GPT-3, with particularly large gains on arithmetic tasks where diverse solution strategies are most available.

Cost-Accuracy Tradeoffs

The primary tradeoff is computational: generating $k$ samples costs $k$ times more than a single greedy decode. Empirical findings:

Diminishing returns: Most gains come from the first 10-20 samples; beyond 40 samples, improvements plateau
Sweet spot: 20-40 samples provides the best accuracy-per-token ratio for most tasks
Free confidence estimates: The vote distribution provides calibrated uncertainty — high consensus indicates high confidence

Practical Advantages

Unsupervised — No additional training, fine-tuning, or human annotation required
Model-agnostic — Works with any language model that supports temperature sampling
Composable — Can be combined with other techniques (e.g., self-consistency over Tree-of-Thoughts paths)
Confidence estimation — Vote proportions serve as natural uncertainty estimates

References

¹⁾

Wang et al. "Self-Consistency Improves Chain of Thought Reasoning in Language Models." arXiv:2203.11171, 2022.

²⁾

Wei et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." arXiv:2201.11903, 2022.

AI Agent Knowledge Base

Sidebar

Table of Contents

Self-Consistency

Motivation

Method

Theoretical Intuition

Relationship to Chain-of-Thought

Key Results

Cost-Accuracy Tradeoffs

Practical Advantages

See Also

References

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Self-Consistency

Motivation

Method

Theoretical Intuition

Relationship to Chain-of-Thought

Key Results

Cost-Accuracy Tradeoffs

Practical Advantages

See Also

References

Page Tools