Chain-of-Thought Reasoning

Chain-of-Thought (CoT) is a prompting technique that encourages large language models to decompose complex reasoning tasks into intermediate steps before arriving at a final answer. Introduced by Wei et al., 2022 in “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” CoT prompting has demonstrated significant improvements on arithmetic, commonsense, and symbolic reasoning benchmarks¹⁾. By making the reasoning process explicit, CoT allows models to allocate more computation to problems that require multi-step logic. Modern implementations such as Nemotron 3 Nano Omni employ enable_thinking toggles and reasoning_budget parameters, allowing developers to trade compute for analytical depth on a per-request basis²⁾.

How Chain-of-Thought Works

CoT prompting augments the standard input-output prompting format by inserting intermediate reasoning steps between the question and the answer. Rather than predicting the answer directly, the model generates a sequence of logical steps that decompose the problem. For example, given a math word problem, the model first identifies the relevant quantities, sets up equations, solves step-by-step, and then states the final answer.

The key insight from Wei et al., 2022 is that this capability is emergent at scale: CoT prompting provides substantial gains only in models with roughly 100B+ parameters. Smaller models tend to produce illogical chains that do not improve accuracy.

The original paper demonstrated that PaLM 540B with just 8 hand-crafted CoT exemplars achieved 58.1% accuracy on GSM8K (grade-school math), up from 17.9% with standard prompting, surpassing even fine-tuned GPT-3 with a verifier.

Few-Shot vs Zero-Shot CoT

Few-Shot CoT (Wei et al., 2022) provides the model with several exemplars that each include a question, a reasoning chain, and an answer. The model learns to mimic the reasoning pattern from these demonstrations. This approach achieved state-of-the-art results across arithmetic (GSM8K, SVAMP, MAWPS), commonsense (CommonsenseQA, StrategyQA), and symbolic reasoning (Last Letter Concatenation, Coin Flip) benchmarks.

The following example demonstrates few-shot CoT prompting with exemplars that teach the model to show its reasoning:

Few-shot Chain-of-Thought prompting example
from [[openai|openai]] import [[openai|OpenAI]]
 
client = [[openai|OpenAI]]()
 
Few-shot exemplars with explicit reasoning chains
few_shot_prompt = (
    "Solve each problem step by step.\n\n"
    "Q: A store has 15 apples. 8 are sold and 12 more arrive. How many apples?\n"
    "A: Start with 15 apples. Sold 8, so 15 - 8 = 7. Then 12 arrive, so 7 + 12 = 19.\n"
    "The answer is 19.\n\n"
    "Q: A train travels 60 mph for 2.5 hours. How far does it go?\n"
    "A: Distance = speed x time. Distance = 60 x 2.5 = 150 miles.\n"
    "The answer is 150 miles.\n\n"
    "Q: A baker makes 4 batches of 12 cookies, then gives away 15. How many remain?\n"
    "A:"
)
 
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": few_shot_prompt}],
    temperature=0,
)
print(response.choices[0].message.content)
Model follows the pattern: 4 x 12 = 48, then 48 - 15 = 33. The answer is 33.

Zero-Shot CoT (Kojima et al., 2022) discovered that simply appending “Let's think step by step” to the prompt triggers chain-of-thought reasoning without any exemplars³⁾. This dramatically reduces the manual effort of crafting demonstrations while achieving competitive performance on many tasks. The simplicity of this approach made it one of the most widely adopted prompting techniques.

Auto-CoT (Zhang et al., 2022) automates demonstration construction by clustering questions by diversity and using the LLM itself with “Let's think step by step” to generate reasoning chains⁴⁾. This avoids the manual effort of few-shot CoT while mitigating the risk of error propagation from poorly constructed exemplars.

Applications and Benchmarks

CoT prompting has been evaluated extensively across reasoning domains:

GSM8K (grade-school math): PaLM 540B improved from 17.9% to 58.1% with CoT
MATH (competition-level math): Significant gains with CoT, especially combined with self-consistency decoding
Big-Bench Hard (BBH): CoT enables strong performance on symbolic and multi-step tasks
MMLU (multi-task knowledge): CoT boosts reasoning-heavy subsets
HumanEval (code generation): Structured CoT aids program synthesis

Self-Consistency (Wang et al., 2022) further improves CoT by sampling multiple reasoning paths and taking the majority-voted answer, yielding additional gains of 5-15% across benchmarks⁵⁾.

Multimodal CoT extends the technique to vision-language models through a two-stage process: first generating a textual rationale from image and text inputs, then inferring the answer by combining the rationale with the multimodal context.

Modern Developments (2024-2025)

Modern reasoning models like OpenAI's o1 and o3 series have internalized chain-of-thought as a native capability. These models generate hidden internal reasoning chains before producing their final output, effectively performing CoT without explicit user prompting. This “internal CoT” approach has pushed performance on benchmarks like GSM8K and MATH to near-saturation levels.

Research in 2025 has revealed that LLM hidden states encode the likelihood of CoT success even before generation begins, as shown by probing classifiers that predict reasoning chain quality from pre-generation representations (ACL 2025 findings). Meanwhile, studies from Wharton's Generative AI Lab show that for already-capable reasoning models, explicit CoT prompting may yield only marginal gains with 20-80% higher latency, suggesting that the technique is most valuable for models that have not yet internalized reasoning patterns.

Limitations and Failure Modes

CoT is not a universal solution. Key limitations include:

Scale dependence: Models under ~100B parameters often produce unfaithful reasoning chains
Faithfulness concerns: The generated chain may not reflect the model's actual computation process
Error propagation: Mistakes in early reasoning steps compound through the chain
Computational cost: Generating reasoning tokens increases latency and cost
Task specificity: CoT provides minimal benefit on tasks that do not require multi-step reasoning

References

¹⁾

Wei et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" arXiv:2201.11903, 2022

²⁾

Cobus Greyling - Chain-of-Thought Reasoning (CoT) / Enable Thinking (2026

³⁾

Kojima et al. "Large Language Models are Zero-Shot Reasoners" arXiv:2205.11916, 2022

⁴⁾

Zhang et al. "Automatic Chain of Thought Prompting in Large Language Models" arXiv:2210.03493, 2022

⁵⁾

Wang et al. "Self-Consistency Improves Chain of Thought Reasoning in Language Models" arXiv:2203.11171, 2022

AI Agent Knowledge Base

Sidebar

Table of Contents

Chain-of-Thought Reasoning

How Chain-of-Thought Works

Few-Shot vs Zero-Shot CoT

Applications and Benchmarks

Modern Developments (2024-2025)

Limitations and Failure Modes

See Also

References

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Chain-of-Thought Reasoning

How Chain-of-Thought Works

Few-Shot vs Zero-Shot CoT

Applications and Benchmarks

Modern Developments (2024-2025)

Limitations and Failure Modes

See Also

References

Page Tools