Program-Aided Language Models (PAL)

Program-Aided Language Models (PAL) is a reasoning framework that divides the problem-solving process between an LLM and a program interpreter. The LLM handles natural language understanding and problem decomposition, while a Python interpreter handles the actual computation. This hybrid approach eliminates arithmetic and logical errors that plague purely text-based reasoning methods like Chain-of-Thought prompting.¹⁾

How It Works

PAL operates through a clear division of labor:

The LLM's role: Reads the natural language problem and generates intermediate reasoning steps as executable Python code.
The interpreter's role: Executes the generated code to compute the final answer.

This means the only “learning” task for the LLM is decomposing the problem into runnable steps. The actual solving is delegated to a deterministic program runtime, eliminating computational errors.

Example

Given a math word problem like:

"Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
 Each can has 3 tennis balls. How many tennis balls does he have now?"

A Chain-of-Thought approach would reason in natural language (potentially making arithmetic mistakes). PAL instead generates:

tennis_balls_initial = 5
cans_bought = 2
balls_per_can = 3
total = tennis_balls_initial + (cans_bought * balls_per_can)
print(total)  # 11

The Python interpreter executes this code and returns the correct answer with certainty.

Comparison to Chain-of-Thought

Aspect	Chain-of-Thought (CoT)	PAL
Reasoning format	Natural language text	Executable Python code
Computation	LLM performs arithmetic	Interpreter performs arithmetic
Error source	Decomposition + computation errors	Only decomposition errors
Correctness guarantee	None (probabilistic)	Computation is deterministic
Task scope	Any reasoning task	Tasks with computable steps
Model requirement	Any LLM	LLM with code generation ability

Benchmark Results

PAL was evaluated across 13 mathematical, symbolic, and algorithmic reasoning tasks:²⁾

Benchmark	PAL vs. CoT Improvement
GSM8K (grade school math)	+8-15% absolute over PaLM-540B with CoT
GSM-Hard (harder variant)	+40% absolute over CoT
BIG-Bench Hard (3 reasoning tasks)	+11% over CoT
COLORED OBJECTS	+8.8% over CoT
PENGUINS	Significant improvement

Notably, PAL using Codex outperformed the much larger PaLM-540B model using chain-of-thought prompting, demonstrating that the hybrid approach can compensate for smaller model size.

Model Compatibility

While PAL was initially developed with code-specialized models like Codex, it generalizes to general-purpose LLMs. Using text-davinci-002 and text-davinci-003, PAL still outperformed CoT, indicating the approach works effectively with models trained primarily on natural language rather than code.³⁾

Limitations

Requires code generation ability: The LLM must be capable of generating syntactically correct, executable code.
Limited to computable tasks: Problems that cannot be expressed as programs (e.g., open-ended creative tasks) are not suitable.
Interpreter dependency: Requires access to a code execution environment, which may pose security concerns.
Decomposition errors persist: If the LLM misunderstands the problem, the generated code will implement the wrong logic.

References

¹⁾

Gao et al. 2023, PAL: Program-Aided Language Models, ICML 2023

²⁾

Gao et al. 2023, experimental results

³⁾

Gao et al. 2023, Section 5

Table of Contents