Tool Learning with Foundation Models

Tool Learning with Foundation Models is a comprehensive survey by Qin et al. (2023) that formalizes how large language models can serve as intelligent controllers that leverage external tools to overcome their inherent limitations. The survey establishes a unified framework covering tool creation, selection, invocation, and evaluation, drawing on cognitive science to ground the paradigm in human tool-use evolution.

Overview

Foundation models excel at language understanding and generation but struggle with precise computation, real-time data access, and physical interaction. Tool learning addresses these gaps by positioning the LLM as an orchestrator that decomposes tasks and delegates specialized operations to external tools. This mirrors how human intelligence evolved to extend biological capabilities through tool creation and use.

Cognitive Origins

The survey grounds tool learning in cognitive science, tracing tool use from early hominid stone tools (~3.3 million years ago) to modern computational tools. Key cognitive pillars that foundation models emulate include:

Planning and Reasoning: Hierarchical decomposition of goals into sub-goals and actions
Dynamic Adjustment: Feedback-driven adaptation, analogous to trial-and-error learning in primates
Generalization: Transfer of tool-use skills across contexts via abstract representation

Foundation models replicate these capabilities through emergent abilities like in-context learning and chain-of-thought reasoning.

The Framework

The framework comprises five interacting components:

Controller: The foundation model that interprets instructions, plans, and orchestrates tool use
Tool Set: Available external tools organized by type
Environment: The context in which tools operate and produce effects
Perceiver: Modules that convert environment state into language feedback
Human: Provides instructions, feedback, and oversight

Tool Use Pipeline

The core pipeline formalizes four stages:

Tool Creation: Development of modular, specialized functions (APIs, scripts, models) for tasks beyond LLM capabilities. An emerging frontier is LLM-driven dynamic tool creation via code generation.
Tool Selection: Given user instruction $I$, the controller decomposes into sub-tasks and selects optimal tools $T^* \subseteq T$ from available set $T$:

$$T^* = \arg\max_{T' \subseteq T} \text{Utility}(T', I)$$

Tool Invocation: The controller generates structured calls (e.g., JSON arguments) to execute selected tools, incorporating outputs back into context for iterative refinement.
Tool Evaluation: Post-invocation assessment of outputs via self-reflection or external feedback, with replanning if results are unsatisfactory.

Taxonomy of Tool Types

Tools are categorized by their interaction modality:

Category	Description	Examples
Perception	Convert raw data into structured representations	OCR, speech-to-text, image captioning
Action	Execute operations via APIs or commands	Web search, code interpreters, robot control
Computation	Perform numerical or symbolic calculations	Calculators, Wolfram Alpha, simulators
Data	Retrieve, store, or manage information	Databases, knowledge graphs, vector stores

This taxonomy highlights the complementary relationship: tools handle precise low-level operations while models manage high-level orchestration.

Code Example

import json
import openai
import requests
 
# Define available tools with schemas
TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "calculator",
            "description": "Evaluate a mathematical expression",
            "parameters": {
                "type": "object",
                "properties": {
                    "expression": {"type": "string", "description": "Math expression to evaluate"}
                },
                "required": ["expression"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "web_search",
            "description": "Search the web for current information",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "Search query"}
                },
                "required": ["query"]
            }
        }
    }
]
 
def execute_tool(name, args):
    if name == "calculator":
        return str(eval(args["expression"]))  # simplified
    elif name == "web_search":
        return requests.get(
            "https://api.search.example/v1/search",
            params={"q": args["query"]}
        ).json()["results"][0]["snippet"]
 
def tool_augmented_generation(query, client):
    messages = [{"role": "user", "content": query}]
 
    while True:
        response = client.chat.completions.create(
            model="gpt-4",
            messages=messages,
            tools=TOOLS,
            tool_choice="auto"
        )
        msg = response.choices[0].message
        messages.append(msg)
 
        if not msg.tool_calls:
            return msg.content
 
        for call in msg.tool_calls:
            result = execute_tool(call.function.name, json.loads(call.function.arguments))
            messages.append({
                "role": "tool",
                "tool_call_id": call.id,
                "content": result
            })

LLMs as Tool Controllers

Foundation models are effective controllers due to several complementary strengths:

World knowledge for informed decision-making about which tools to apply
Planning capability over long task horizons
Natural language interface for interpreting tool descriptions and API documentation
Code generation for producing executable tool invocations

Key benefits of the tool-augmented approach include interpretability (tool calls expose reasoning), robustness (verifiable API outputs reduce hallucination), and efficiency (offloading compute-intensive sub-tasks).

Experimental Findings

The survey evaluates 18 representative tools across the taxonomy:

Zero/few-shot prompting achieves 80-90% accuracy on tool-enabled tasks vs. 20-50% without tools
Multi-tool chains (search, compute, verify) achieve near-perfect scores on complex math
Models struggle with dynamic selection of novel tool combinations (<50% accuracy)
GPT-4 demonstrates self-correction capabilities that validate the framework

AI Agent Knowledge Base

Sidebar

Table of Contents

Tool Learning with Foundation Models

Overview

Cognitive Origins

The Framework

Tool Use Pipeline

Taxonomy of Tool Types

Code Example

LLMs as Tool Controllers

Experimental Findings

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Tool Learning with Foundation Models

Overview

Cognitive Origins

The Framework

Tool Use Pipeline

Taxonomy of Tool Types

Code Example

LLMs as Tool Controllers

Experimental Findings

References

See Also

Page Tools