Table of Contents

Program-Aided Language Models (PAL)

Program-Aided Language Models (PAL) is a reasoning framework that divides the problem-solving process between an LLM and a program interpreter. The LLM handles natural language understanding and problem decomposition, while a Python interpreter handles the actual computation. This hybrid approach eliminates arithmetic and logical errors that plague purely text-based reasoning methods like Chain-of-Thought prompting.1)

How It Works

PAL operates through a clear division of labor:

  1. The LLM's role: Reads the natural language problem and generates intermediate reasoning steps as executable Python code.
  2. The interpreter's role: Executes the generated code to compute the final answer.

This means the only “learning” task for the LLM is decomposing the problem into runnable steps. The actual solving is delegated to a deterministic program runtime, eliminating computational errors.

Example

Given a math word problem like:

"Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
 Each can has 3 tennis balls. How many tennis balls does he have now?"

A Chain-of-Thought approach would reason in natural language (potentially making arithmetic mistakes). PAL instead generates:

tennis_balls_initial = 5
cans_bought = 2
balls_per_can = 3
total = tennis_balls_initial + (cans_bought * balls_per_can)
print(total)  # 11

The Python interpreter executes this code and returns the correct answer with certainty.

Comparison to Chain-of-Thought

Aspect Chain-of-Thought (CoT) PAL
Reasoning format Natural language text Executable Python code
Computation LLM performs arithmetic Interpreter performs arithmetic
Error source Decomposition + computation errors Only decomposition errors
Correctness guarantee None (probabilistic) Computation is deterministic
Task scope Any reasoning task Tasks with computable steps
Model requirement Any LLM LLM with code generation ability

Benchmark Results

PAL was evaluated across 13 mathematical, symbolic, and algorithmic reasoning tasks:2)

Benchmark PAL vs. CoT Improvement
GSM8K (grade school math) +8-15% absolute over PaLM-540B with CoT
GSM-Hard (harder variant) +40% absolute over CoT
BIG-Bench Hard (3 reasoning tasks) +11% over CoT
COLORED OBJECTS +8.8% over CoT
PENGUINS Significant improvement

Notably, PAL using Codex outperformed the much larger PaLM-540B model using chain-of-thought prompting, demonstrating that the hybrid approach can compensate for smaller model size.

Model Compatibility

While PAL was initially developed with code-specialized models like Codex, it generalizes to general-purpose LLMs. Using text-davinci-002 and text-davinci-003, PAL still outperformed CoT, indicating the approach works effectively with models trained primarily on natural language rather than code.3)

Limitations

See Also

References

1)
Gao et al. 2023, PAL: Program-Aided Language Models, ICML 2023
2)
Gao et al. 2023, experimental results
3)
Gao et al. 2023, Section 5