====== Self-Consistency ====== **Self-Consistency** is a decoding strategy introduced by Wang et al. (2022) that improves chain-of-thought (CoT) prompting by sampling multiple diverse reasoning paths from a language model and selecting the most consistent final answer via majority vote. The method replaces greedy decoding with stochastic sampling and marginalizes over reasoning chains, achieving substantial accuracy gains across arithmetic, commonsense, and symbolic reasoning benchmarks. graph TD Q[Question] --> P1[Path 1: Reasoning A] Q --> P2[Path 2: Reasoning B] Q --> P3[Path 3: Reasoning C] Q --> P4[Path 4: Reasoning D] Q --> P5[Path 5: Reasoning E] P1 & P2 & P3 & P4 & P5 --> MV[Majority Vote] MV --> ANS[Final Answer] ===== Motivation ===== Standard chain-of-thought prompting uses greedy decoding, producing a single reasoning path. This is brittle: if the model makes an error anywhere in the chain, the final answer is wrong. The key insight is that complex problems typically admit **multiple valid reasoning strategies** that all lead to the same correct answer, while incorrect answers tend to arise from isolated, low-probability error paths. ===== Method ===== Self-consistency operates in three steps: - **Prompt with CoT exemplars** — Use standard few-shot chain-of-thought prompting (identical to regular CoT) - **Sample diverse reasoning paths** — Instead of greedy decoding, sample $k$ independent completions using temperature-based sampling - **Majority vote** — Extract the final answer from each path and select the answer that appears most frequently $$a^* = \text{mode}\left(\{\text{extract}(r_i)\}_{i=1}^{k}\right)$$ where $r_i \sim p_\theta(\cdot | \text{prompt}, T)$ are sampled reasoning chains at temperature $T$. ===== Theoretical Intuition ===== The approach is grounded in the observation that diverse valid solution paths converge on correct answers while errors scatter across different wrong answers: $$P(a^* = a_{\text{correct}}) \geq P(r_i \text{ yields } a_{\text{correct}})$$ This is analogous to ensemble methods in machine learning — the "wisdom of crowds" applied to reasoning chains from a single model. Self-consistency can also be viewed as an approximation to marginalization over the reasoning path: $$P(a | q) = \sum_{r} P(a | r) \cdot P(r | q) \approx \frac{1}{k} \sum_{i=1}^{k} \mathbb{1}[\text{extract}(r_i) = a]$$ import collections class SelfConsistency: def __init__(self, model, num_samples=40, temperature=0.7): self.model = model self.num_samples = num_samples self.temperature = temperature def solve(self, question, cot_exemplars): prompt = self._build_cot_prompt(question, cot_exemplars) # Sample k diverse reasoning paths (replaces greedy decoding) reasoning_paths = [] for _ in range(self.num_samples): path = self.model.generate( prompt, temperature=self.temperature, max_tokens=512 ) reasoning_paths.append(path) # Extract final answer from each path answers = [self._extract_answer(path) for path in reasoning_paths] # Majority vote vote_counts = collections.Counter(answers) best_answer = vote_counts.most_common(1)[0][0] # Confidence = proportion of votes for winning answer confidence = vote_counts[best_answer] / len(answers) return best_answer, confidence def _extract_answer(self, reasoning_path): # Parse the final "The answer is X" from the reasoning chain lines = reasoning_path.strip().split("\n") for line in reversed(lines): if "answer is" in line.lower(): return line.split("answer is")[-1].strip().rstrip(".") return lines[-1].strip() def _build_cot_prompt(self, question, exemplars): prompt = "" for ex in exemplars: prompt += f"Q: {ex['question']}\nA: {ex['reasoning']}\n\n" prompt += f"Q: {question}\nA: Let's think step by step." return prompt ===== Relationship to Chain-of-Thought ===== Self-consistency is a **pure decoding strategy** layered on top of CoT: ^ Aspect ^ CoT (Greedy) ^ CoT + Self-Consistency ^ | Prompting | Few-shot CoT exemplars | Same few-shot CoT exemplars | | Decoding | Greedy (top-1 token) | Temperature sampling, $k$ paths | | Answer selection | Single path output | Majority vote across $k$ paths | | Cost | 1x | $k$x (linear in samples) | | Accuracy | Baseline | Significantly improved | The key innovation is entirely in the decoding strategy — no changes to prompts, model, or training. ===== Key Results ===== Self-consistency demonstrated substantial improvements across multiple benchmarks: ^ Benchmark ^ Task Type ^ Improvement over CoT ^ | GSM8K | Arithmetic reasoning | +17.9% | | SVAMP | Arithmetic | +11.0% | | AQuA | Algebraic | +12.2% | | MultiArith | Multi-step arithmetic | +24.0% | | StrategyQA | Commonsense | +6.4% | | ARC-challenge | Science reasoning | +3.9% | | CommonsenseQA | Commonsense | +5.0% | The method achieved new state-of-the-art when used with PaLM-540B and GPT-3, with particularly large gains on arithmetic tasks where diverse solution strategies are most available. ===== Cost-Accuracy Tradeoffs ===== The primary tradeoff is computational: generating $k$ samples costs $k$ times more than a single greedy decode. Empirical findings: * **Diminishing returns**: Most gains come from the first 10-20 samples; beyond 40 samples, improvements plateau * **Sweet spot**: 20-40 samples provides the best accuracy-per-token ratio for most tasks * **Free confidence estimates**: The vote distribution provides calibrated uncertainty — high consensus indicates high confidence ===== Practical Advantages ===== * **Unsupervised** — No additional training, fine-tuning, or human annotation required * **Model-agnostic** — Works with any language model that supports temperature sampling * **Composable** — Can be combined with other techniques (e.g., self-consistency over Tree-of-Thoughts paths) * **Confidence estimation** — Vote proportions serve as natural uncertainty estimates ===== References ===== * [[https://arxiv.org/abs/2203.11171|Wang et al. "Self-Consistency Improves Chain of Thought Reasoning in Language Models" (arXiv:2203.11171)]] * [[https://arxiv.org/abs/2201.11903|Wei et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (arXiv:2201.11903)]] ===== See Also ===== * [[chain_of_thought]] * [[tree_of_thoughts]] * [[multi_agent_debate]] * [[buffer_of_thoughts]]