====== Automatic Reasoning and Tool-Use (ART) ====== Automatic Reasoning and Tool-use (ART) is a framework that enables frozen large language models to automatically generate multi-step reasoning programs and integrate external tool outputs without requiring fine-tuning or hand-crafted task-specific demonstrations. ART automates both chain-of-thought decomposition and tool selection using a task library approach.((Paranjape et al. 2023, [[https://arxiv.org/abs/2303.09014|ART: Automatic multi-step reasoning and tool-use for large language models]])) ===== Motivation ===== Traditional chain-of-thought prompting and tool-use approaches rely on carefully hand-crafted, task-specific demonstrations and manually scripted interleaving between model generations and tool calls.((Wei et al. 2022; Schick et al. 2023)) This manual effort limits scalability and requires expertise for each new task. ART automates this entire process while keeping the underlying LLM frozen. ===== How It Works ===== ART operates in two primary phases: ==== Selection Phase ==== Given a new task, ART selects relevant demonstrations of multi-step reasoning and tool use from a **task library** -- a structured repository containing examples of how to decompose and solve various task types with appropriate tools. ==== Execution Phase ==== At test time, ART dynamically generates reasoning steps as a program. When external tools are needed, the generation: - **Pauses** at the tool invocation point. - **Executes** the selected tool with appropriate inputs. - **Integrates** the tool output into the reasoning chain. - **Resumes** generation with the augmented context. This seamless interleaving of reasoning and tool use happens automatically without manual scripting. ===== Task Library and Tool Library ===== ART's extensibility depends on two key components:((Paranjape et al. 2023, Section 3)) * **Task library**: Contains demonstrations of multi-step reasoning and tool-use patterns that the model can learn from and apply to new tasks through in-context learning. * **Tool library**: Maintains a collection of available external tools (e.g., search engines, calculators, code interpreters) that can be invoked during reasoning. The framework uses a selection mechanism to identify the most appropriate demonstrations and tools for each new task, enabling zero-shot generalization. ===== Frozen LLM Approach ===== A distinguishing feature of ART is its use of **frozen LLMs** -- the underlying language model requires no fine-tuning or parameter updates. This offers several advantages: * The same pre-trained model works across diverse tasks. * No computational cost of retraining. * The framework leverages existing few-shot learning capabilities through carefully selected in-context demonstrations. ===== Human-in-the-Loop ===== ART is designed to support human feedback. Humans can improve performance by: * Correcting errors in task-specific reasoning programs. * Adding new tools to the tool library. * Updating task demonstrations in the task library. This makes ART an extensible system that improves with minimal human intervention. ===== Benchmark Results ===== ART demonstrates substantial improvements across major benchmarks:((Paranjape et al. 2023, experimental results)) | **Comparison** | **Improvement** | | ART vs. few-shot prompting (unseen tasks) | +10.8% average | | Tool-use contribution (additional) | +12.3 percentage points | | ART vs. hand-crafted CoT | Matches on majority of tasks | | ART + human feedback vs. hand-crafted CoT | Exceeds performance | Evaluated on BigBench and MMLU benchmarks, ART excels particularly on arithmetic and algorithmic reasoning tasks. ===== Comparison to Other Methods ===== | **Method** | **Approach** | **ART Advantage** | | Few-shot prompting | Fixed examples, no tools | +10.8% from automated decomposition + tools | | Standard CoT | Manual reasoning chains | Automated, no per-task engineering | | Hand-crafted CoT + tools | Manual scripting of tool calls | Fully automated tool selection and integration | | PAL | Code generation for computation | Broader tool set beyond code execution | The key advantage is that ART automates both reasoning decomposition and tool selection without requiring manual crafting for each new task. ===== Limitations ===== * **Task library coverage**: Performance depends on having relevant demonstrations in the task library for new task types. * **Tool library scope**: Limited by the set of available tools; novel tool types require manual addition. * **Selection quality**: Poor demonstration selection can lead to suboptimal reasoning strategies. * **Frozen model constraints**: Cannot adapt the LLM itself to better use tools or reason about novel domains. ===== See Also ===== * [[prompt_engineering]] * [[chain_of_thought_prompting]] * [[program_aided_language_models]] * [[automatic_prompt_engineer]] * [[few_shot_prompting]] * [[zero_shot_prompting]] ===== References =====