Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Program-Aided Language Models (PAL) is a reasoning framework that divides the problem-solving process between an LLM and a program interpreter. The LLM handles natural language understanding and problem decomposition, while a Python interpreter handles the actual computation. This hybrid approach eliminates arithmetic and logical errors that plague purely text-based reasoning methods like Chain-of-Thought prompting.1)
PAL operates through a clear division of labor:
This means the only “learning” task for the LLM is decomposing the problem into runnable steps. The actual solving is delegated to a deterministic program runtime, eliminating computational errors.
Given a math word problem like:
"Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?"
A Chain-of-Thought approach would reason in natural language (potentially making arithmetic mistakes). PAL instead generates:
tennis_balls_initial = 5 cans_bought = 2 balls_per_can = 3 total = tennis_balls_initial + (cans_bought * balls_per_can) print(total) # 11
The Python interpreter executes this code and returns the correct answer with certainty.
| Aspect | Chain-of-Thought (CoT) | PAL |
| Reasoning format | Natural language text | Executable Python code |
| Computation | LLM performs arithmetic | Interpreter performs arithmetic |
| Error source | Decomposition + computation errors | Only decomposition errors |
| Correctness guarantee | None (probabilistic) | Computation is deterministic |
| Task scope | Any reasoning task | Tasks with computable steps |
| Model requirement | Any LLM | LLM with code generation ability |
PAL was evaluated across 13 mathematical, symbolic, and algorithmic reasoning tasks:2)
| Benchmark | PAL vs. CoT Improvement |
| GSM8K (grade school math) | +8-15% absolute over PaLM-540B with CoT |
| GSM-Hard (harder variant) | +40% absolute over CoT |
| BIG-Bench Hard (3 reasoning tasks) | +11% over CoT |
| COLORED OBJECTS | +8.8% over CoT |
| PENGUINS | Significant improvement |
Notably, PAL using Codex outperformed the much larger PaLM-540B model using chain-of-thought prompting, demonstrating that the hybrid approach can compensate for smaller model size.
While PAL was initially developed with code-specialized models like Codex, it generalizes to general-purpose LLMs. Using text-davinci-002 and text-davinci-003, PAL still outperformed CoT, indicating the approach works effectively with models trained primarily on natural language rather than code.3)