Table of Contents

Automatic Reasoning and Tool-Use (ART)

Automatic Reasoning and Tool-use (ART) is a framework that enables frozen large language models to automatically generate multi-step reasoning programs and integrate external tool outputs without requiring fine-tuning or hand-crafted task-specific demonstrations. ART automates both chain-of-thought decomposition and tool selection using a task library approach.1)

Motivation

Traditional chain-of-thought prompting and tool-use approaches rely on carefully hand-crafted, task-specific demonstrations and manually scripted interleaving between model generations and tool calls.2) This manual effort limits scalability and requires expertise for each new task. ART automates this entire process while keeping the underlying LLM frozen.

How It Works

ART operates in two primary phases:

Selection Phase

Given a new task, ART selects relevant demonstrations of multi-step reasoning and tool use from a task library – a structured repository containing examples of how to decompose and solve various task types with appropriate tools.

Execution Phase

At test time, ART dynamically generates reasoning steps as a program. When external tools are needed, the generation:

  1. Pauses at the tool invocation point.
  2. Executes the selected tool with appropriate inputs.
  3. Integrates the tool output into the reasoning chain.
  4. Resumes generation with the augmented context.

This seamless interleaving of reasoning and tool use happens automatically without manual scripting.

Task Library and Tool Library

ART's extensibility depends on two key components:3)

The framework uses a selection mechanism to identify the most appropriate demonstrations and tools for each new task, enabling zero-shot generalization.

Frozen LLM Approach

A distinguishing feature of ART is its use of frozen LLMs – the underlying language model requires no fine-tuning or parameter updates. This offers several advantages:

Human-in-the-Loop

ART is designed to support human feedback. Humans can improve performance by:

This makes ART an extensible system that improves with minimal human intervention.

Benchmark Results

ART demonstrates substantial improvements across major benchmarks:4)

Comparison Improvement
ART vs. few-shot prompting (unseen tasks) +10.8% average
Tool-use contribution (additional) +12.3 percentage points
ART vs. hand-crafted CoT Matches on majority of tasks
ART + human feedback vs. hand-crafted CoT Exceeds performance

Evaluated on BigBench and MMLU benchmarks, ART excels particularly on arithmetic and algorithmic reasoning tasks.

Comparison to Other Methods

Method Approach ART Advantage
Few-shot prompting Fixed examples, no tools +10.8% from automated decomposition + tools
Standard CoT Manual reasoning chains Automated, no per-task engineering
Hand-crafted CoT + tools Manual scripting of tool calls Fully automated tool selection and integration
PAL Code generation for computation Broader tool set beyond code execution

The key advantage is that ART automates both reasoning decomposition and tool selection without requiring manual crafting for each new task.

Limitations

See Also

References

2)
Wei et al. 2022; Schick et al. 2023
3)
Paranjape et al. 2023, Section 3
4)
Paranjape et al. 2023, experimental results