====== Automatic Reasoning and Tool-Use (ART) ======

Automatic Reasoning and Tool-use (ART) is a framework that enables frozen large language models to automatically generate multi-step reasoning programs and integrate external tool outputs without requiring fine-tuning or hand-crafted task-specific demonstrations. ART automates both chain-of-thought decomposition and tool selection using a task library approach.((Paranjape et al. 2023, [[https://arxiv.org/abs/2303.09014|ART: Automatic multi-step reasoning and tool-use for large language models]]))

===== Motivation =====

Traditional chain-of-thought prompting and tool-use approaches rely on carefully hand-crafted, task-specific demonstrations and manually scripted interleaving between model generations and tool calls.((Wei et al. 2022; Schick et al. 2023)) This manual effort limits scalability and requires expertise for each new task. ART automates this entire process while keeping the underlying LLM frozen.

===== How It Works =====

ART operates in two primary phases:

==== Selection Phase ====

Given a new task, ART selects relevant demonstrations of multi-step reasoning and tool use from a **task library** -- a structured repository containing examples of how to decompose and solve various task types with appropriate tools.

==== Execution Phase ====

At test time, ART dynamically generates reasoning steps as a program. When external tools are needed, the generation:

  - **Pauses** at the tool invocation point.
  - **Executes** the selected tool with appropriate inputs.
  - **Integrates** the tool output into the reasoning chain.
  - **Resumes** generation with the augmented context.

This seamless interleaving of reasoning and tool use happens automatically without manual scripting.

===== Task Library and Tool Library =====

ART's extensibility depends on two key components:((Paranjape et al. 2023, Section 3))

  * **Task library**: Contains demonstrations of multi-step reasoning and tool-use patterns that the model can learn from and apply to new tasks through in-context learning.
  * **Tool library**: Maintains a collection of available external tools (e.g., search engines, calculators, code interpreters) that can be invoked during reasoning.

The framework uses a selection mechanism to identify the most appropriate demonstrations and tools for each new task, enabling zero-shot generalization.

===== Frozen LLM Approach =====

A distinguishing feature of ART is its use of **frozen LLMs** -- the underlying language model requires no fine-tuning or parameter updates. This offers several advantages:

  * The same pre-trained model works across diverse tasks.
  * No computational cost of retraining.
  * The framework leverages existing few-shot learning capabilities through carefully selected in-context demonstrations.

===== Human-in-the-Loop =====

ART is designed to support human feedback. Humans can improve performance by:

  * Correcting errors in task-specific reasoning programs.
  * Adding new tools to the tool library.
  * Updating task demonstrations in the task library.

This makes ART an extensible system that improves with minimal human intervention.

===== Benchmark Results =====

ART demonstrates substantial improvements across major benchmarks:((Paranjape et al. 2023, experimental results))

| **Comparison** | **Improvement** |
| ART vs. few-shot prompting (unseen tasks) | +10.8% average |
| Tool-use contribution (additional) | +12.3 percentage points |
| ART vs. hand-crafted CoT | Matches on majority of tasks |
| ART + human feedback vs. hand-crafted CoT | Exceeds performance |

Evaluated on BigBench and MMLU benchmarks, ART excels particularly on arithmetic and algorithmic reasoning tasks.

===== Comparison to Other Methods =====

| **Method** | **Approach** | **ART Advantage** |
| Few-shot prompting | Fixed examples, no tools | +10.8% from automated decomposition + tools |
| Standard CoT | Manual reasoning chains | Automated, no per-task engineering |
| Hand-crafted CoT + tools | Manual scripting of tool calls | Fully automated tool selection and integration |
| PAL | Code generation for computation | Broader tool set beyond code execution |

The key advantage is that ART automates both reasoning decomposition and tool selection without requiring manual crafting for each new task.

===== Limitations =====

  * **Task library coverage**: Performance depends on having relevant demonstrations in the task library for new task types.
  * **Tool library scope**: Limited by the set of available tools; novel tool types require manual addition.
  * **Selection quality**: Poor demonstration selection can lead to suboptimal reasoning strategies.
  * **Frozen model constraints**: Cannot adapt the LLM itself to better use tools or reason about novel domains.

===== See Also =====

  * [[prompt_engineering]]
  * [[chain_of_thought_prompting]]
  * [[program_aided_language_models]]
  * [[automatic_prompt_engineer]]
  * [[few_shot_prompting]]
  * [[zero_shot_prompting]]

===== References =====