Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Program of Thoughts (PoT) is a prompting method introduced by Chen et al. (2023) that disentangles reasoning from computation by having LLMs generate executable Python code as their “thoughts” instead of natural language reasoning steps. The generated code is executed by an external Python interpreter to produce precise numerical answers, eliminating arithmetic errors inherent in text-based Chain-of-Thought reasoning.
Chain-of-Thought prompting requires the LLM to perform both reasoning and computation within natural language. While LLMs are strong reasoners, they are unreliable calculators – accumulating errors in multi-step arithmetic, struggling with large numbers, and failing at symbolic manipulation. PoT leverages a key insight: LLMs trained on code (e.g., Codex) can express reasoning as programs and delegate exact computation to an interpreter.
PoT follows a generate-then-execute pipeline:
The reasoning structure is preserved in the code (variable names, comments, step sequencing) while all arithmetic is handled externally.
$$\text{PoT}: q \xrightarrow{\text{LLM}} \text{code} \xrightarrow{\text{exec}} \text{answer}$$
$$\text{CoT}: q \xrightarrow{\text{LLM}} \text{NL reasoning + answer}$$
import subprocess import tempfile class ProgramOfThoughts: def __init__(self, llm, exemplars, max_retries=3): self.llm = llm self.exemplars = exemplars # 8-shot code examples self.max_retries = max_retries def solve(self, question): prompt = self._build_prompt(question) for attempt in range(self.max_retries): # Step 1: LLM generates Python code code = self.llm.generate(prompt, stop=["Q:", "---"]) # Step 2: Execute in sandboxed interpreter answer = self._execute_code(code) if answer is not None: return answer return None # All retries failed def _execute_code(self, code): # Execute generated Python code safely and capture output with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f: f.write(code) f.flush() try: result = subprocess.run( ['python3', f.name], capture_output=True, text=True, timeout=10 ) if result.returncode == 0 and result.stdout.strip(): return result.stdout.strip() except subprocess.TimeoutExpired: pass return None def _build_prompt(self, question): prompt = "# Solve each question by writing Python code.\n\n" for ex in self.exemplars: prompt += f"Q: {ex['question']}\n" prompt += f"# Solution:\n{ex['code']}\n\n" prompt += f"Q: {question}\n# Solution:\n" return prompt # Example: PoT solving a word problem # Q: "A store has 45 apples. They sell 3/5 of them and receive # a shipment of 28 more. How many apples do they have?" # PoT generates: # apples_initial = 45 # apples_sold = apples_initial * (3/5) # apples_remaining = apples_initial - apples_sold # apples_after_shipment = apples_remaining + 28 # print(int(apples_after_shipment)) # >>> 46
| Aspect | Chain-of-Thought (CoT) | Program of Thoughts (PoT) |
|---|---|---|
| Reasoning medium | Natural language | Python code |
| Computation | Performed by the LLM (error-prone) | External interpreter (exact) |
| Large numbers | Frequent errors | Handled precisely |
| Symbolic ops | Limited | Full Python capabilities |
| Loops/conditionals | Implicit in text | Explicit in code |
| Token efficiency | Verbose for calculations | Compact code expressions |
| Qualitative reasoning | Strong | Weaker (code less natural for it) |
PoT and CoT are complementary: PoT excels at quantitative/symbolic tasks while CoT is better for qualitative reasoning and tasks requiring world knowledge.
Evaluated primarily with Codex (code-davinci-002) using few-shot prompting:
| Dataset | PoT (Few-Shot) | CoT (Few-Shot) | Improvement |
|---|---|---|---|
| GSM8K | 58.0% | 56.7% | +1.3% |
| AQuA | 64.6% | 50.7% | +13.9% |
| SVAMP | 89.0% | 80.2% | +8.8% |
| FinQA | 55.6% | – | Financial QA |
| MultiArith | Superior | – | Multi-step |
The largest gains appear on tasks requiring precise multi-step arithmetic (AQuA, SVAMP) where CoT's computation errors accumulate. Ablation studies confirm that code execution boosts accuracy by 10-15% over treating the generated code as text-only reasoning.