====== Self-Consistency ======
**Self-Consistency** is a decoding strategy introduced by Wang et al. (2022) that improves chain-of-thought (CoT) prompting by sampling multiple diverse reasoning paths from a language model and selecting the most consistent final answer via majority vote.(([[https://arxiv.org/abs/2203.11171|Wang et al. "Self-Consistency Improves Chain of Thought Reasoning in Language Models." arXiv:2203.11171, 2022.]])) The method replaces greedy decoding with stochastic sampling and marginalizes over reasoning chains, achieving substantial accuracy gains across arithmetic, commonsense, and symbolic reasoning benchmarks.(([[https://arxiv.org/abs/2201.11903|Wei et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." arXiv:2201.11903, 2022.]]))
graph TD
Q[Question] --> P1[Path 1: Reasoning A]
Q --> P2[Path 2: Reasoning B]
Q --> P3[Path 3: Reasoning C]
Q --> P4[Path 4: Reasoning D]
Q --> P5[Path 5: Reasoning E]
P1 & P2 & P3 & P4 & P5 --> MV[Majority Vote]
MV --> ANS[Final Answer]
===== Motivation =====
Standard chain-of-thought prompting uses greedy decoding, producing a single reasoning path. This is brittle: if the model makes an error anywhere in the chain, the final answer is wrong. The key insight is that complex problems typically admit **multiple valid reasoning strategies** that all lead to the same correct answer, while incorrect answers tend to arise from isolated, low-probability error paths.
===== Method =====
Self-consistency operates in three steps:
- **Prompt with CoT exemplars** — Use standard few-shot chain-of-thought prompting (identical to regular CoT)
- **Sample diverse reasoning paths** — Instead of greedy decoding, sample $k$ independent completions using temperature-based sampling
- **Majority vote** — Extract the final answer from each path and select the answer that appears most frequently
$$a^* = \text{mode}\left(\{\text{extract}(r_i)\}_{i=1}^{k}\right)$$
where $r_i \sim p_\theta(\cdot | \text{prompt}, T)$ are sampled reasoning chains at temperature $T$.
===== Theoretical Intuition =====
The approach is grounded in the observation that diverse valid solution paths converge on correct answers while errors scatter across different wrong answers:
$$P(a^* = a_{\text{correct}}) \geq P(r_i \text{ yields } a_{\text{correct}})$$
This is analogous to ensemble methods in machine learning — the "wisdom of crowds" applied to reasoning chains from a single model. Self-consistency can also be viewed as an approximation to marginalization over the reasoning path:
$$P(a | q) = \sum_{r} P(a | r) \cdot P(r | q) \approx \frac{1}{k} \sum_{i=1}^{k} \mathbb{1}[\text{extract}(r_i) = a]$$
import collections
class SelfConsistency:
def __init__(self, model, num_samples=40, temperature=0.7):
self.model = model
self.num_samples = num_samples
self.temperature = temperature
def solve(self, question, cot_exemplars):
prompt = self._build_cot_prompt(question, cot_exemplars)
# Sample k diverse reasoning paths (replaces greedy decoding)
reasoning_paths = []
for _ in range(self.num_samples):
path = self.model.generate(
prompt,
temperature=self.temperature,
max_tokens=512
)
reasoning_paths.append(path)
# Extract final answer from each path
answers = [self._extract_answer(path) for path in reasoning_paths]
# Majority vote
vote_counts = collections.Counter(answers)
best_answer = vote_counts.most_common(1)[0][0]
# Confidence = proportion of votes for winning answer
confidence = vote_countsbest_answer / len(answers)
return best_answer, confidence
def _extract_answer(self, reasoning_path):
# Parse the final "The answer is X" from the reasoning chain
lines = reasoning_path.strip().split("\n")
for line in reversed(lines):
if "answer is" in line.lower():
return line.split("answer is")[-1].strip().rstrip(".")
return lines[-1].strip()
def _build_cot_prompt(self, question, exemplars):
prompt = ""
for ex in exemplars:
prompt += f"Q: {ex['question']}\nA: {ex['reasoning']}\n\n"
prompt += f"Q: {question}\nA: Let's think step by step."
return prompt
===== Relationship to Chain-of-Thought =====
Self-consistency is a **pure decoding strategy** layered on top of CoT:
^ Aspect ^ CoT (Greedy) ^ CoT + Self-Consistency ^
| Prompting | Few-shot CoT exemplars | Same few-shot CoT exemplars |
| Decoding | Greedy (top-1 token) | Temperature sampling, $k$ paths |
| Answer selection | Single path output | Majority vote across $k$ paths |
| Cost | 1x | $k$x (linear in samples) |
| Accuracy | Baseline | Significantly improved |
The key innovation is entirely in the decoding strategy — no changes to prompts, model, or training.
===== Key Results =====
Self-consistency demonstrated substantial improvements across multiple benchmarks:
^ Benchmark ^ Task Type ^ Improvement over CoT ^
| GSM8K | Arithmetic reasoning | +17.9% |
| SVAMP | Arithmetic | +11.0% |
| AQuA | Algebraic | +12.2% |
| MultiArith | Multi-step arithmetic | +24.0% |
| StrategyQA | Commonsense | +6.4% |
| ARC-challenge | Science reasoning | +3.9% |
| CommonsenseQA | Commonsense | +5.0% |
The method achieved new state-of-the-art when used with PaLM-540B and GPT-3, with particularly large gains on arithmetic tasks where diverse solution strategies are most available.
===== Cost-Accuracy Tradeoffs =====
The primary tradeoff is computational: generating $k$ samples costs $k$ times more than a single greedy decode. Empirical findings:
* **Diminishing returns**: Most gains come from the first 10-20 samples; beyond 40 samples, improvements plateau
* **Sweet spot**: 20-40 samples provides the best accuracy-per-token ratio for most tasks
* **Free confidence estimates**: The vote distribution provides calibrated uncertainty — high consensus indicates high confidence
===== Practical Advantages =====
* **Unsupervised** — No additional training, fine-tuning, or human annotation required
* **Model-agnostic** — Works with any language model that supports temperature sampling
* **Composable** — Can be combined with other techniques (e.g., self-consistency over Tree-of-Thoughts paths)
* **Confidence estimation** — Vote proportions serve as natural uncertainty estimates
===== See Also =====
* [[quiet_star|Quiet-STaR]]
* [[chain_of_thought|Chain-of-Thought Reasoning]]
* [[reasoning_models|Reasoning Models]]
* [[tree_of_thoughts|Tree of Thoughts]]
* [[chain_of_draft|Chain of Draft]]
===== References =====