====== ToolLLM: Facilitating Large Language Models to Master 16,000+ Real-World APIs ======

**ToolLLM** is a general-purpose tool-use framework introduced by Qin et al. (2023)(([[https://arxiv.org/abs/2307.16789|Qin et al. "ToolLLM: Facilitating Large Language Models to Master 16,000+ Real-World APIs" (2023)]])) that enables open-source LLMs to effectively leverage a massive collection of real-world APIs. The paper has accumulated over **1,305 citations**, establishing it as a foundational work in LLM tool-use research, building on earlier work like Toolformer((([[https://arxiv.org/abs/2302.04761|Schick et al. "Toolformer: Language Models Can Teach Themselves to Use Tools" (2023).]]))) and Large Language Models as Tool Makers((([[https://arxiv.org/abs/2305.17126|Cai et al. "Large Language Models as Tool Makers" (2023).]])))). The framework addresses a critical gap between closed-source models (e.g., ChatGPT) and open-source alternatives by providing a comprehensive data construction, training, and evaluation pipeline.

[[https://arxiv.org/abs/2307.16789|arXiv:2307.16789]]

===== Core Components =====

==== ToolBench Dataset ====

ToolBench is a large-scale instruction-tuning dataset containing **126,486 (instruction, solution path) pairs** covering **16,464 real-world RESTful APIs** spanning 49 categories from RapidAPI Hub(([[https://github.com/OpenBMB/ToolBench|ToolBench GitHub Repository]])). The dataset was automatically constructed using ChatGPT (gpt-3.5-turbo-16k) through a three-phase process:

  * **Single-tool instructions**: Tasks solvable with a single API
  * **Intra-category multi-tool instructions**: Tasks requiring multiple APIs from the same category
  * **Intra-collection multi-tool instructions**: Tasks spanning APIs across different categories

==== DFSDT: Depth-First Search-based Decision Tree ====

DFSDT is a novel reasoning algorithm that expands the search space beyond linear chain-of-thought by evaluating multiple reasoning traces in a depth-first search manner.

The search process can be formalized as:

$$T^* = \arg\max_{T \in \mathcal{T}} R(T)$$

where $T$ is a reasoning trace in the search tree $\mathcal{T}$ and $R(T)$ is the reward (successful API execution). At each node, the LLM generates candidate actions, expands promising branches, and backtracks from failed paths.

==== ToolEval Evaluation Framework ====

ToolEval is an automatic evaluation framework using two metrics:

  * **Pass Rate**: Percentage of instructions successfully executed within budget constraints
  * **Win Rate**: Pairwise comparison of solution path quality, judged by ChatGPT

Both metrics show high correlation with human evaluation.

===== Architecture =====

<code python>
# Simplified ToolLLM inference loop with DFSDT
def dfsdt_search(instruction, api_retriever, llm, max_depth=5):
    apis = api_retriever.retrieve(instruction, top_k=5)
    root = Node(state=instruction, apis=apis)
    stack = [root]
    best_path = None

    while stack:
        node = stack.pop()
        if node.depth >= max_depth:
            continue
        # LLM generates candidate actions (API calls)
        actions = llm.generate_actions(node.state, node.apis)
        for action in actions:
            observation = execute_api_call(action)
            child = Node(
                state=update_state(node.state, action, observation),
                parent=node, depth=node.depth + 1
            )
            if is_task_complete(child.state):
                path = reconstruct_path(child)
                if best_path is None or score(path) > score(best_path):
                    best_path = path
            else:
                stack.append(child)
    return best_path
</code>

===== System Architecture =====

<mermaid>
graph TD
    A[User Instruction] --> B[API Retriever]
    B --> C[Relevant APIs]
    C --> D[DFSDT Reasoning Engine]
    D --> E{Generate Candidate Actions}
    E --> F[API Call Execution]
    F --> G{Task Complete?}
    G -- No --> H[Expand Search Tree]
    H --> E
    G -- Yes --> I[Return Best Solution Path]
    D --> J[Backtrack on Failure]
    J --> E
    K[ToolBench Training Data] --> L[Fine-tune LLaMA]
    L --> M[ToolLLaMA]
    M --> D
</mermaid>

===== Key Results =====

  * ToolLLaMA matches ChatGPT and approaches GPT-4 performance on ToolBench(([[https://arxiv.org/abs/2307.16789|Qin et al. "ToolLLM" (2023)]]))
  * Outperforms Text-Davinci-003 and Claude-2 on tool-use tasks
  * Strong zero-shot generalization to unseen APIs, tools, and categories
  * Neural API retriever effectively selects relevant APIs from 16K+ candidates

===== See Also =====

  * [[llm_tool_makers|LATM: Large Language Models as Tool Makers]]
  * [[chemcrow|ChemCrow: LLM Agent with Chemistry Tools]]
  * [[reasoning_via_planning|RAP: Reasoning via Planning]]

===== References =====