Program of Thoughts

Program of Thoughts (PoT) is a prompting method introduced by Chen et al. (2023)¹⁾ that disentangles reasoning from computation by having LLMs generate executable Python code as their “thoughts” instead of natural language reasoning steps. The generated code is executed by an external Python interpreter to produce precise numerical answers, eliminating arithmetic errors inherent in text-based Chain-of-Thought reasoning.²⁾

graph TD subgraph PoT[Program of Thoughts] A1[Question] --> B1[LLM Generates Python Code] B1 --> C1[Execute in Interpreter] C1 --> D1[Precise Numerical Answer] end subgraph CoT[Chain of Thought] A2[Question] --> B2[NL Reasoning Steps] B2 --> C2[LLM Computes Answer] C2 --> D2[Error-Prone Answer] end

Motivation

Chain-of-Thought prompting requires the LLM to perform both reasoning and computation within natural language. While LLMs are strong reasoners, they are unreliable calculators, accumulating errors in multi-step arithmetic, struggling with large numbers, and failing at symbolic manipulation. PoT leverages a key insight: LLMs trained on code (e.g., Codex) can express reasoning as programs and delegate exact computation to an interpreter.

Method

PoT follows a generate-then-execute pipeline:

Prompt the LLM with few-shot examples showing questions paired with Python code solutions (typically 8-shot)
Generate Python code that encodes the reasoning steps as variable assignments, operations, and control flow
Execute the generated code in a sandboxed Python interpreter
Capture the printed output as the final answer

The reasoning structure is preserved in the code (variable names, comments, step sequencing) while all arithmetic is handled externally.

$$\text{PoT}: q \xrightarrow{\text{LLM}} \text{code} \xrightarrow{\text{exec}} \text{answer}$$

$$\text{CoT}: q \xrightarrow{\text{LLM}} \text{NL reasoning + answer}$$

import subprocess
import tempfile
 
class ProgramOfThoughts:
    def __init__(self, llm, exemplars, max_retries=3):
        self.llm = llm
        self.exemplars = exemplars  # 8-shot code examples
        self.max_retries = max_retries
 
    def solve(self, question):
        prompt = self._build_prompt(question)
 
        for attempt in range(self.max_retries):
            # Step 1: LLM generates Python code
            code = self.llm.generate(prompt, stop=["Q:", "---"])
 
            # Step 2: Execute in sandboxed interpreter
            answer = self._execute_code(code)
            if answer is not None:
                return answer
 
        return None  # All retries failed
 
    def _execute_code(self, code):
        # Execute generated Python code safely and capture output
        with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
            f.write(code)
            f.flush()
            try:
                result = subprocess.run(
                    ['python3', f.name],
                    capture_output=True, text=True, timeout=10
                )
                if result.returncode == 0 and result.stdout.strip():
                    return result.stdout.strip()
            except subprocess.TimeoutExpired:
                pass
        return None
 
    def _build_prompt(self, question):
        prompt = "# Solve each question by writing Python code.\n\n"
        for ex in self.exemplars:
            prompt += f"Q: {ex['question']}\n"
            prompt += f"# Solution:\n{ex['code']}\n\n"
        prompt += f"Q: {question}\n# Solution:\n"
        return prompt
 
 
# Example: PoT solving a word problem
# Q: "A store has 45 apples. They sell 3/5 of them and receive
#     a shipment of 28 more. How many apples do they have?"
 
# PoT generates:
# apples_initial = 45
# apples_sold = apples_initial * (3/5)
# apples_remaining = apples_initial - apples_sold
# apples_after_shipment = apples_remaining + 28
# print(int(apples_after_shipment))
# >>> 46

Comparison with Chain-of-Thought

Aspect	Chain-of-Thought (CoT)	Program of Thoughts (PoT)
Reasoning medium	Natural language	Python code
Computation	Performed by the LLM (error-prone)	External interpreter (exact)
Large numbers	Frequent errors	Handled precisely
Symbolic ops	Limited	Full Python capabilities
Loops/conditionals	Implicit in text	Explicit in code
Token efficiency	Verbose for calculations	Compact code expressions
Qualitative reasoning	Strong	Weaker (code less natural for it)

PoT and CoT are complementary: PoT excels at quantitative/symbolic tasks while CoT is better for qualitative reasoning and tasks requiring world knowledge.

Key Results

Evaluated primarily with Codex (code-davinci-002) using few-shot prompting:

Dataset	PoT (Few-Shot)	CoT (Few-Shot)	Improvement
GSM8K	58.0%	56.7%	+1.3%
AQuA	64.6%	50.7%	+13.9%
SVAMP	89.0%	80.2%	+8.8%
FinQA	55.6%	,	Financial QA
MultiArith	Superior	,	Multi-step

The largest gains appear on tasks requiring precise multi-step arithmetic (AQuA, SVAMP) where CoT's computation errors accumulate. Ablation studies confirm that code execution boosts accuracy by 10-15% over treating the generated code as text-only reasoning.³⁾

Advantages for Numerical and Symbolic Reasoning

Precision, Eliminates hallucinated arithmetic (e.g., $12345 \times 6789$)
Scalability, Handles arbitrarily large numbers and complex symbolic operations
Composability, Python supports conditionals, loops, and function calls for complex logic
Determinism, Same code always produces the same answer
Verifiability, Generated code can be inspected and debugged

Implementation Details

Model: Primarily Codex (code-davinci-002); also tested with PaLM
Prompting: 8-shot with code + executed output; prompts include “Please print the final answer”
Execution: Python 3 interpreter in sandbox; captures stdout; up to 3 retries on errors
Zero-shot variant: Tested but underperforms few-shot by 15-20%

References

¹⁾

Chen et al. "Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks" arXiv:2211.12588

²⁾

Wei et al. "Chain-of-Thought Prompting" arXiv:2201.11903

³⁾

Wang et al. "Self-Consistency Improves Chain of Thought Reasoning" arXiv:2203.11171

AI Agent Knowledge Base

Sidebar

Table of Contents

Program of Thoughts

Motivation

Method

Comparison with Chain-of-Thought

Key Results

Advantages for Numerical and Symbolic Reasoning

Implementation Details

See Also

References

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Program of Thoughts

Motivation

Method

Comparison with Chain-of-Thought

Key Results

Advantages for Numerical and Symbolic Reasoning

Implementation Details

See Also

References

Page Tools