====== Self-Consistency ======
**Self-Consistency** is a decoding strategy introduced by Wang et al. (2022) that improves chain-of-thought (CoT) prompting by sampling multiple diverse reasoning paths from a language model and selecting the most consistent final answer via majority vote.(([[https://arxiv.org/abs/2203.11171|Wang et al. "Self-Consistency Improves Chain of Thought Reasoning in Language Models." arXiv:2203.11171, 2022.]])) The method replaces greedy decoding with stochastic sampling and marginalizes over reasoning chains, achieving substantial accuracy gains across arithmetic, commonsense, and symbolic reasoning benchmarks.(([[https://arxiv.org/abs/2201.11903|Wei et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." arXiv:2201.11903, 2022.]]))

<mermaid>
graph TD
    Q[Question] --> P1[Path 1: Reasoning A]
    Q --> P2[Path 2: Reasoning B]
    Q --> P3[Path 3: Reasoning C]
    Q --> P4[Path 4: Reasoning D]
    Q --> P5[Path 5: Reasoning E]
    P1 & P2 & P3 & P4 & P5 --> MV[Majority Vote]
    MV --> ANS[Final Answer]
</mermaid>

===== Motivation =====
Standard chain-of-thought prompting uses greedy decoding, producing a single reasoning path. This is brittle: if the model makes an error anywhere in the chain, the final answer is wrong. The key insight is that complex problems typically admit **multiple valid reasoning strategies** that all lead to the same correct answer, while incorrect answers tend to arise from isolated, low-probability error paths.

===== Method =====
Self-consistency operates in three steps:

  - **Prompt with CoT exemplars** — Use standard few-shot chain-of-thought prompting (identical to regular CoT)
  - **Sample diverse reasoning paths** — Instead of greedy decoding, sample $k$ independent completions using temperature-based sampling
  - **Majority vote** — Extract the final answer from each path and select the answer that appears most frequently

$$a^* = \text{mode}\left(\{\text{extract}(r_i)\}_{i=1}^{k}\right)$$

where $r_i \sim p_\theta(\cdot | \text{prompt}, T)$ are sampled reasoning chains at temperature $T$.

===== Theoretical Intuition =====
The approach is grounded in the observation that diverse valid solution paths converge on correct answers while errors scatter across different wrong answers:

$$P(a^* = a_{\text{correct}}) \geq P(r_i \text{ yields } a_{\text{correct}})$$

This is analogous to ensemble methods in machine learning — the "wisdom of crowds" applied to reasoning chains from a single model. Self-consistency can also be viewed as an approximation to marginalization over the reasoning path:

$$P(a | q) = \sum_{r} P(a | r) \cdot P(r | q) \approx \frac{1}{k} \sum_{i=1}^{k} \mathbb{1}[\text{extract}(r_i) = a]$$

<code python>
import collections

class SelfConsistency:
    def __init__(self, model, num_samples=40, temperature=0.7):
        self.model = model
        self.num_samples = num_samples
        self.temperature = temperature

    def solve(self, question, cot_exemplars):
        prompt = self._build_cot_prompt(question, cot_exemplars)

        # Sample k diverse reasoning paths (replaces greedy decoding)
        reasoning_paths = []
        for _ in range(self.num_samples):
            path = self.model.generate(
                prompt,
                temperature=self.temperature,
                max_tokens=512
            )
            reasoning_paths.append(path)

        # Extract final answer from each path
        answers = [self._extract_answer(path) for path in reasoning_paths]

        # Majority vote
        vote_counts = collections.Counter(answers)
        best_answer = vote_counts.most_common(1)[0][0]

        # Confidence = proportion of votes for winning answer
        confidence = vote_countsbest_answer / len(answers)

        return best_answer, confidence

    def _extract_answer(self, reasoning_path):
        # Parse the final "The answer is X" from the reasoning chain
        lines = reasoning_path.strip().split("\n")
        for line in reversed(lines):
            if "answer is" in line.lower():
                return line.split("answer is")[-1].strip().rstrip(".")
        return lines[-1].strip()

    def _build_cot_prompt(self, question, exemplars):
        prompt = ""
        for ex in exemplars:
            prompt += f"Q: {ex['question']}\nA: {ex['reasoning']}\n\n"
        prompt += f"Q: {question}\nA: Let's think step by step."
        return prompt
</code>

===== Relationship to Chain-of-Thought =====
Self-consistency is a **pure decoding strategy** layered on top of CoT:

^ Aspect ^ CoT (Greedy) ^ CoT + Self-Consistency ^
| Prompting | Few-shot CoT exemplars | Same few-shot CoT exemplars |
| Decoding | Greedy (top-1 token) | Temperature sampling, $k$ paths |
| Answer selection | Single path output | Majority vote across $k$ paths |
| Cost | 1x | $k$x (linear in samples) |
| Accuracy | Baseline | Significantly improved |

The key innovation is entirely in the decoding strategy — no changes to prompts, model, or training.

===== Key Results =====
Self-consistency demonstrated substantial improvements across multiple benchmarks:

^ Benchmark ^ Task Type ^ Improvement over CoT ^
| GSM8K | Arithmetic reasoning | +17.9% |
| SVAMP | Arithmetic | +11.0% |
| AQuA | Algebraic | +12.2% |
| MultiArith | Multi-step arithmetic | +24.0% |
| StrategyQA | Commonsense | +6.4% |
| ARC-challenge | Science reasoning | +3.9% |
| CommonsenseQA | Commonsense | +5.0% |

The method achieved new state-of-the-art when used with PaLM-540B and GPT-3, with particularly large gains on arithmetic tasks where diverse solution strategies are most available.

===== Cost-Accuracy Tradeoffs =====
The primary tradeoff is computational: generating $k$ samples costs $k$ times more than a single greedy decode. Empirical findings:

  * **Diminishing returns**: Most gains come from the first 10-20 samples; beyond 40 samples, improvements plateau
  * **Sweet spot**: 20-40 samples provides the best accuracy-per-token ratio for most tasks
  * **Free confidence estimates**: The vote distribution provides calibrated uncertainty — high consensus indicates high confidence

===== Practical Advantages =====
  * **Unsupervised** — No additional training, fine-tuning, or human annotation required
  * **Model-agnostic** — Works with any language model that supports temperature sampling
  * **Composable** — Can be combined with other techniques (e.g., self-consistency over Tree-of-Thoughts paths)
  * **Confidence estimation** — Vote proportions serve as natural uncertainty estimates

===== See Also =====
  * [[quiet_star|Quiet-STaR]]
  * [[chain_of_thought|Chain-of-Thought Reasoning]]
  * [[reasoning_models|Reasoning Models]]
  * [[tree_of_thoughts|Tree of Thoughts]]
  * [[chain_of_draft|Chain of Draft]]

===== References =====