Motivation
How It Works
- Selection Phase
- Execution Phase
Task Library and Tool Library
Frozen LLM Approach
Human-in-the-Loop
Benchmark Results
Comparison to Other Methods
Limitations
See Also
References

Automatic Reasoning and Tool-Use (ART)

Automatic Reasoning and Tool-use (ART) is a framework that enables frozen large language models to automatically generate multi-step reasoning programs and integrate external tool outputs without requiring fine-tuning or hand-crafted task-specific demonstrations. ART automates both chain-of-thought decomposition and tool selection using a task library approach.¹⁾

Motivation

Traditional chain-of-thought prompting and tool-use approaches rely on carefully hand-crafted, task-specific demonstrations and manually scripted interleaving between model generations and tool calls.²⁾ This manual effort limits scalability and requires expertise for each new task. ART automates this entire process while keeping the underlying LLM frozen.

How It Works

ART operates in two primary phases:

Selection Phase

Given a new task, ART selects relevant demonstrations of multi-step reasoning and tool use from a task library – a structured repository containing examples of how to decompose and solve various task types with appropriate tools.

Execution Phase

At test time, ART dynamically generates reasoning steps as a program. When external tools are needed, the generation:

Pauses at the tool invocation point.
Executes the selected tool with appropriate inputs.
Integrates the tool output into the reasoning chain.
Resumes generation with the augmented context.

This seamless interleaving of reasoning and tool use happens automatically without manual scripting.

Task Library and Tool Library

ART's extensibility depends on two key components:³⁾

Task library: Contains demonstrations of multi-step reasoning and tool-use patterns that the model can learn from and apply to new tasks through in-context learning.
Tool library: Maintains a collection of available external tools (e.g., search engines, calculators, code interpreters) that can be invoked during reasoning.

The framework uses a selection mechanism to identify the most appropriate demonstrations and tools for each new task, enabling zero-shot generalization.

Frozen LLM Approach

A distinguishing feature of ART is its use of frozen LLMs – the underlying language model requires no fine-tuning or parameter updates. This offers several advantages:

The same pre-trained model works across diverse tasks.
No computational cost of retraining.
The framework leverages existing few-shot learning capabilities through carefully selected in-context demonstrations.

Human-in-the-Loop

ART is designed to support human feedback. Humans can improve performance by:

Correcting errors in task-specific reasoning programs.
Adding new tools to the tool library.
Updating task demonstrations in the task library.

This makes ART an extensible system that improves with minimal human intervention.

Benchmark Results

ART demonstrates substantial improvements across major benchmarks:⁴⁾

Comparison	Improvement
ART vs. few-shot prompting (unseen tasks)	+10.8% average
Tool-use contribution (additional)	+12.3 percentage points
ART vs. hand-crafted CoT	Matches on majority of tasks
ART + human feedback vs. hand-crafted CoT	Exceeds performance

Evaluated on BigBench and MMLU benchmarks, ART excels particularly on arithmetic and algorithmic reasoning tasks.

Comparison to Other Methods

Method	Approach	ART Advantage
Few-shot prompting	Fixed examples, no tools	+10.8% from automated decomposition + tools
Standard CoT	Manual reasoning chains	Automated, no per-task engineering
Hand-crafted CoT + tools	Manual scripting of tool calls	Fully automated tool selection and integration
PAL	Code generation for computation	Broader tool set beyond code execution

The key advantage is that ART automates both reasoning decomposition and tool selection without requiring manual crafting for each new task.

Limitations

Task library coverage: Performance depends on having relevant demonstrations in the task library for new task types.
Tool library scope: Limited by the set of available tools; novel tool types require manual addition.
Selection quality: Poor demonstration selection can lead to suboptimal reasoning strategies.
Frozen model constraints: Cannot adapt the LLM itself to better use tools or reason about novel domains.

References

¹⁾

Paranjape et al. 2023, ART: Automatic multi-step reasoning and tool-use for large language models

²⁾

Wei et al. 2022; Schick et al. 2023

³⁾

Paranjape et al. 2023, Section 3

⁴⁾

Paranjape et al. 2023, experimental results

Table of Contents