Browse
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety
Meta
Browse
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety
Meta
Chain-of-Thought (CoT) is a prompting technique that encourages large language models to decompose complex reasoning tasks into intermediate steps before arriving at a final answer. Introduced by Wei et al., 2022 in “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” CoT prompting has demonstrated significant improvements on arithmetic, commonsense, and symbolic reasoning benchmarks1). By making the reasoning process explicit, CoT allows models to allocate more computation to problems that require multi-step logic. Modern implementations such as Nemotron 3 Nano Omni employ enable_thinking toggles and reasoning_budget parameters, allowing developers to trade compute for analytical depth on a per-request basis2).
CoT prompting augments the standard input-output prompting format by inserting intermediate reasoning steps between the question and the answer. Rather than predicting the answer directly, the model generates a sequence of logical steps that decompose the problem. For example, given a math word problem, the model first identifies the relevant quantities, sets up equations, solves step-by-step, and then states the final answer.
The key insight from Wei et al., 2022 is that this capability is emergent at scale: CoT prompting provides substantial gains only in models with roughly 100B+ parameters. Smaller models tend to produce illogical chains that do not improve accuracy.
The original paper demonstrated that PaLM 540B with just 8 hand-crafted CoT exemplars achieved 58.1% accuracy on GSM8K (grade-school math), up from 17.9% with standard prompting, surpassing even fine-tuned GPT-3 with a verifier.
Few-Shot CoT (Wei et al., 2022) provides the model with several exemplars that each include a question, a reasoning chain, and an answer. The model learns to mimic the reasoning pattern from these demonstrations. This approach achieved state-of-the-art results across arithmetic (GSM8K, SVAMP, MAWPS), commonsense (CommonsenseQA, StrategyQA), and symbolic reasoning (Last Letter Concatenation, Coin Flip) benchmarks.
The following example demonstrates few-shot CoT prompting with exemplars that teach the model to show its reasoning:
Few-shot Chain-of-Thought prompting example from [[openai|openai]] import [[openai|OpenAI]] client = [[openai|OpenAI]]() Few-shot exemplars with explicit reasoning chains few_shot_prompt = ( "Solve each problem step by step.\n\n" "Q: A store has 15 apples. 8 are sold and 12 more arrive. How many apples?\n" "A: Start with 15 apples. Sold 8, so 15 - 8 = 7. Then 12 arrive, so 7 + 12 = 19.\n" "The answer is 19.\n\n" "Q: A train travels 60 mph for 2.5 hours. How far does it go?\n" "A: Distance = speed x time. Distance = 60 x 2.5 = 150 miles.\n" "The answer is 150 miles.\n\n" "Q: A baker makes 4 batches of 12 cookies, then gives away 15. How many remain?\n" "A:" ) response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": few_shot_prompt}], temperature=0, ) print(response.choices[0].message.content) Model follows the pattern: 4 x 12 = 48, then 48 - 15 = 33. The answer is 33.
Zero-Shot CoT (Kojima et al., 2022) discovered that simply appending “Let's think step by step” to the prompt triggers chain-of-thought reasoning without any exemplars3). This dramatically reduces the manual effort of crafting demonstrations while achieving competitive performance on many tasks. The simplicity of this approach made it one of the most widely adopted prompting techniques.
Auto-CoT (Zhang et al., 2022) automates demonstration construction by clustering questions by diversity and using the LLM itself with “Let's think step by step” to generate reasoning chains4). This avoids the manual effort of few-shot CoT while mitigating the risk of error propagation from poorly constructed exemplars.
CoT prompting has been evaluated extensively across reasoning domains:
Self-Consistency (Wang et al., 2022) further improves CoT by sampling multiple reasoning paths and taking the majority-voted answer, yielding additional gains of 5-15% across benchmarks5).
Multimodal CoT extends the technique to vision-language models through a two-stage process: first generating a textual rationale from image and text inputs, then inferring the answer by combining the rationale with the multimodal context.
Modern reasoning models like OpenAI's o1 and o3 series have internalized chain-of-thought as a native capability. These models generate hidden internal reasoning chains before producing their final output, effectively performing CoT without explicit user prompting. This “internal CoT” approach has pushed performance on benchmarks like GSM8K and MATH to near-saturation levels.
Research in 2025 has revealed that LLM hidden states encode the likelihood of CoT success even before generation begins, as shown by probing classifiers that predict reasoning chain quality from pre-generation representations (ACL 2025 findings). Meanwhile, studies from Wharton's Generative AI Lab show that for already-capable reasoning models, explicit CoT prompting may yield only marginal gains with 20-80% higher latency, suggesting that the technique is most valuable for models that have not yet internalized reasoning patterns.
CoT is not a universal solution. Key limitations include: