Chain of Draft

Chain of Draft (CoD) is a prompting technique for large language models (LLMs) that produces concise, minimalistic intermediate reasoning steps instead of verbose explanations. Introduced in February 2025 by researchers at Zoom Communications, CoD matches or surpasses the accuracy of Chain of Thought (CoT) prompting while using as little as 7.6% of the reasoning tokens.¹⁾

Background

Chain of Thought prompting, introduced by Wei et al. in 2022, revolutionized LLM reasoning by instructing models to “think step by step,” producing detailed intermediate reasoning chains.²⁾ While effective at boosting accuracy on arithmetic, commonsense, and symbolic reasoning tasks, CoT generates verbose outputs that increase token usage, inference latency, and cost. The authors of Chain of Draft observed that this verbosity contrasts with how humans actually solve problems: by jotting down only the essential pieces of information needed to advance toward a solution.

How It Works

CoD modifies the standard CoT approach with a single key constraint: each intermediate reasoning step must be kept to roughly five words or fewer. The technique uses few-shot prompting with manually crafted examples that demonstrate this concise style.

The core instruction appended to the prompt is:

Think step by step, but only keep a minimum draft for each thinking step,
with 5 words at most. Return the answer at the end of the response after
a separator ####.

Rather than writing full sentences of explanation, the model outputs only the critical calculation or transformation at each step, similar to how a human might scribble shorthand notes on a scratch pad.

Example Comparison

Consider the problem: “Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12. How many lollipops did Jason give to Denny?”

Chain of Thought output:

Jason started with 20 lollipops. He gave some to Denny, and now he has 12.
To find how many he gave away, we subtract: 20 - 12 = 8.
So Jason gave Denny 8 lollipops.
#### 8

Chain of Draft output:

20 - 12 = 8
#### 8

The CoD response conveys the same reasoning path in a fraction of the tokens.

Benchmarks and Results

The paper evaluated CoD against standard (direct answer) prompting and CoT prompting on four reasoning benchmarks using GPT-4o and Claude 3.5 Sonnet.³⁾

Arithmetic Reasoning: GSM8k

Model	Standard	CoT	CoD	CoT Tokens	CoD Tokens
GPT-4o	53.3%	95.4%	91.1%	205.1	43.9
Claude 3.5 Sonnet	64.6%	95.8%	91.4%	190.0	39.8

CoD reduced token usage by approximately 79% while maintaining accuracy within 4-5 percentage points of CoT.

Commonsense Reasoning: Date Understanding

Model	Standard	CoT	CoD	CoT Tokens	CoD Tokens
GPT-4o	72.6%	90.2%	88.1%	75.7	30.2
Claude 3.5 Sonnet	84.3%	87.0%	89.7%	172.5	31.3

On Date Understanding, Claude 3.5 Sonnet with CoD actually surpassed CoT accuracy (89.7% vs. 87.0%) while using only 18.2% of the tokens.

Commonsense Reasoning: Sports Understanding

Model	Standard	CoT	CoD	CoT Tokens	CoD Tokens
GPT-4o	90.0%	95.9%	98.3%	28.7	15.0
Claude 3.5 Sonnet	90.6%	93.2%	97.3%	189.4	14.3

Sports Understanding produced the most dramatic results: CoD outperformed CoT on both models while Claude used only 7.6% of CoT's tokens – the headline figure cited in the paper.

Symbolic Reasoning: Coin Flip

Model	Standard	CoT	CoD	CoT Tokens	CoD Tokens
GPT-4o	73.2%	100.0%	100.0%	52.4	16.8
Claude 3.5 Sonnet	85.2%	100.0%	100.0%	135.3	18.9

Both methods achieved perfect accuracy on the Coin Flip task, but CoD used 68-86% fewer tokens.

Limitations

The paper identifies several important limitations:

Zero-shot performance degrades significantly. Without few-shot examples demonstrating the concise style, CoD accuracy drops sharply. On zero-shot GSM8k, Claude 3.5 Sonnet fell from 91.4% (few-shot CoD) to 65.5%, and GPT-4o fell from 91.1% to 84.4%.⁴⁾
Small models struggle. Models under 3B parameters showed substantial accuracy drops with CoD compared to CoT. For example, Llama 3.2 3B scored 70.7% with CoT but only 52.5% with CoD on GSM8k.
The 5-word limit is a soft guideline, not a hard constraint. Models sometimes exceed it, and the optimal brevity level may vary by task.
Complex multi-step problems requiring detailed intermediate state may lose critical information when reasoning is compressed too aggressively.

Relationship to Other Techniques

CoD occupies a specific niche in the landscape of prompting strategies:

Chain of Thought (CoT) maximizes reasoning transparency at the cost of verbosity. CoD trades some transparency for major efficiency gains.
Skeleton of Thought (SoT) parallelizes generation by first producing an outline, then filling in sections concurrently. SoT targets latency through parallelism; CoD targets token efficiency through compression.
Budget-Aware Reasoning dynamically adjusts reasoning depth based on problem difficulty. CoD applies a uniform compression strategy regardless of complexity.
Reasoning models (e.g., OpenAI o1, DeepSeek R1) use internal chain-of-thought during inference. CoD is a prompting-level technique that works with any instruction-following LLM without architectural changes.

Practical Recommendations

Use CoD with few-shot examples tailored to the task domain for best results.
CoD is most effective with large, capable models (GPT-4o class and above).
For tasks where every percentage point of accuracy matters, CoT remains the safer choice on arithmetic benchmarks.
For tasks requiring low latency or cost optimization where near-CoT accuracy is acceptable, CoD offers substantial savings.
CoD and CoT can be combined adaptively: use CoD for simpler sub-problems and CoT for complex ones within a single pipeline.

References

¹⁾

Xu S, Xie W, Zhao L, He P. “Chain of Draft: Thinking Faster by Writing Less.” arXiv:2502.18600, February 2025. arxiv.org

²⁾

Wei J et al. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” NeurIPS 2022. arxiv.org

³⁾

Xu S et al. “Chain of Draft: Thinking Faster by Writing Less.” arxiv.org

⁴⁾