Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Code & Software
Safety & Security
Evaluation
Research
Development
Meta
Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Code & Software
Safety & Security
Evaluation
Research
Development
Meta
Self-Consistency is a decoding strategy introduced by Wang et al. (2022) that improves chain-of-thought (CoT) prompting by sampling multiple diverse reasoning paths from a language model and selecting the most consistent final answer via majority vote. The method replaces greedy decoding with stochastic sampling and marginalizes over reasoning chains, achieving substantial accuracy gains across arithmetic, commonsense, and symbolic reasoning benchmarks.
Standard chain-of-thought prompting uses greedy decoding, producing a single reasoning path. This is brittle: if the model makes an error anywhere in the chain, the final answer is wrong. The key insight is that complex problems typically admit multiple valid reasoning strategies that all lead to the same correct answer, while incorrect answers tend to arise from isolated, low-probability error paths.
Self-consistency operates in three steps:
$$a^* = \text{mode}\left(\{\text{extract}(r_i)\}_{i=1}^{k}\right)$$
where $r_i \sim p_\theta(\cdot | \text{prompt}, T)$ are sampled reasoning chains at temperature $T$.
The approach is grounded in the observation that diverse valid solution paths converge on correct answers while errors scatter across different wrong answers:
$$P(a^* = a_{\text{correct}}) \geq P(r_i \text{ yields } a_{\text{correct}})$$
This is analogous to ensemble methods in machine learning — the “wisdom of crowds” applied to reasoning chains from a single model. Self-consistency can also be viewed as an approximation to marginalization over the reasoning path:
$$P(a | q) = \sum_{r} P(a | r) \cdot P(r | q) \approx \frac{1}{k} \sum_{i=1}^{k} \mathbb{1}[\text{extract}(r_i) = a]$$
import collections class SelfConsistency: def __init__(self, model, num_samples=40, temperature=0.7): self.model = model self.num_samples = num_samples self.temperature = temperature def solve(self, question, cot_exemplars): prompt = self._build_cot_prompt(question, cot_exemplars) # Sample k diverse reasoning paths (replaces greedy decoding) reasoning_paths = [] for _ in range(self.num_samples): path = self.model.generate( prompt, temperature=self.temperature, max_tokens=512 ) reasoning_paths.append(path) # Extract final answer from each path answers = [self._extract_answer(path) for path in reasoning_paths] # Majority vote vote_counts = collections.Counter(answers) best_answer = vote_counts.most_common(1)[0][0] # Confidence = proportion of votes for winning answer confidence = vote_counts[best_answer] / len(answers) return best_answer, confidence def _extract_answer(self, reasoning_path): # Parse the final "The answer is X" from the reasoning chain lines = reasoning_path.strip().split("\n") for line in reversed(lines): if "answer is" in line.lower(): return line.split("answer is")[-1].strip().rstrip(".") return lines[-1].strip() def _build_cot_prompt(self, question, exemplars): prompt = "" for ex in exemplars: prompt += f"Q: {ex['question']}\nA: {ex['reasoning']}\n\n" prompt += f"Q: {question}\nA: Let's think step by step." return prompt
Self-consistency is a pure decoding strategy layered on top of CoT:
| Aspect | CoT (Greedy) | CoT + Self-Consistency |
|---|---|---|
| Prompting | Few-shot CoT exemplars | Same few-shot CoT exemplars |
| Decoding | Greedy (top-1 token) | Temperature sampling, $k$ paths |
| Answer selection | Single path output | Majority vote across $k$ paths |
| Cost | 1x | $k$x (linear in samples) |
| Accuracy | Baseline | Significantly improved |
The key innovation is entirely in the decoding strategy — no changes to prompts, model, or training.
Self-consistency demonstrated substantial improvements across multiple benchmarks:
| Benchmark | Task Type | Improvement over CoT |
|---|---|---|
| GSM8K | Arithmetic reasoning | +17.9% |
| SVAMP | Arithmetic | +11.0% |
| AQuA | Algebraic | +12.2% |
| MultiArith | Multi-step arithmetic | +24.0% |
| StrategyQA | Commonsense | +6.4% |
| ARC-challenge | Science reasoning | +3.9% |
| CommonsenseQA | Commonsense | +5.0% |
The method achieved new state-of-the-art when used with PaLM-540B and GPT-3, with particularly large gains on arithmetic tasks where diverse solution strategies are most available.
The primary tradeoff is computational: generating $k$ samples costs $k$ times more than a single greedy decode. Empirical findings: