====== ToolLLM: Facilitating Large Language Models to Master 16,000+ Real-World APIs ======
**ToolLLM** is a general-purpose tool-use framework introduced by Qin et al. (2023)(([[https://arxiv.org/abs/2307.16789|Qin et al. "ToolLLM: Facilitating Large Language Models to Master 16,000+ Real-World APIs" (2023)]])) that enables open-source LLMs to effectively leverage a massive collection of real-world APIs. The paper has accumulated over **1,305 citations**, establishing it as a foundational work in LLM tool-use research, building on earlier work like Toolformer((([[https://arxiv.org/abs/2302.04761|Schick et al. "Toolformer: Language Models Can Teach Themselves to Use Tools" (2023).]]))) and Large Language Models as Tool Makers((([[https://arxiv.org/abs/2305.17126|Cai et al. "Large Language Models as Tool Makers" (2023).]])))). The framework addresses a critical gap between closed-source models (e.g., ChatGPT) and open-source alternatives by providing a comprehensive data construction, training, and evaluation pipeline.
[[https://arxiv.org/abs/2307.16789|arXiv:2307.16789]]
===== Core Components =====
==== ToolBench Dataset ====
ToolBench is a large-scale instruction-tuning dataset containing **126,486 (instruction, solution path) pairs** covering **16,464 real-world RESTful APIs** spanning 49 categories from RapidAPI Hub(([[https://github.com/OpenBMB/ToolBench|ToolBench GitHub Repository]])). The dataset was automatically constructed using ChatGPT (gpt-3.5-turbo-16k) through a three-phase process:
* **Single-tool instructions**: Tasks solvable with a single API
* **Intra-category multi-tool instructions**: Tasks requiring multiple APIs from the same category
* **Intra-collection multi-tool instructions**: Tasks spanning APIs across different categories
==== DFSDT: Depth-First Search-based Decision Tree ====
DFSDT is a novel reasoning algorithm that expands the search space beyond linear chain-of-thought by evaluating multiple reasoning traces in a depth-first search manner.
The search process can be formalized as:
$$T^* = \arg\max_{T \in \mathcal{T}} R(T)$$
where $T$ is a reasoning trace in the search tree $\mathcal{T}$ and $R(T)$ is the reward (successful API execution). At each node, the LLM generates candidate actions, expands promising branches, and backtracks from failed paths.
==== ToolEval Evaluation Framework ====
ToolEval is an automatic evaluation framework using two metrics:
* **Pass Rate**: Percentage of instructions successfully executed within budget constraints
* **Win Rate**: Pairwise comparison of solution path quality, judged by ChatGPT
Both metrics show high correlation with human evaluation.
===== Architecture =====
# Simplified ToolLLM inference loop with DFSDT
def dfsdt_search(instruction, api_retriever, llm, max_depth=5):
apis = api_retriever.retrieve(instruction, top_k=5)
root = Node(state=instruction, apis=apis)
stack = [root]
best_path = None
while stack:
node = stack.pop()
if node.depth >= max_depth:
continue
# LLM generates candidate actions (API calls)
actions = llm.generate_actions(node.state, node.apis)
for action in actions:
observation = execute_api_call(action)
child = Node(
state=update_state(node.state, action, observation),
parent=node, depth=node.depth + 1
)
if is_task_complete(child.state):
path = reconstruct_path(child)
if best_path is None or score(path) > score(best_path):
best_path = path
else:
stack.append(child)
return best_path
===== System Architecture =====
graph TD
A[User Instruction] --> B[API Retriever]
B --> C[Relevant APIs]
C --> D[DFSDT Reasoning Engine]
D --> E{Generate Candidate Actions}
E --> F[API Call Execution]
F --> G{Task Complete?}
G -- No --> H[Expand Search Tree]
H --> E
G -- Yes --> I[Return Best Solution Path]
D --> J[Backtrack on Failure]
J --> E
K[ToolBench Training Data] --> L[Fine-tune LLaMA]
L --> M[ToolLLaMA]
M --> D
===== Key Results =====
* ToolLLaMA matches ChatGPT and approaches GPT-4 performance on ToolBench(([[https://arxiv.org/abs/2307.16789|Qin et al. "ToolLLM" (2023)]]))
* Outperforms Text-Davinci-003 and Claude-2 on tool-use tasks
* Strong zero-shot generalization to unseen APIs, tools, and categories
* Neural API retriever effectively selects relevant APIs from 16K+ candidates
===== See Also =====
* [[llm_tool_makers|LATM: Large Language Models as Tool Makers]]
* [[chemcrow|ChemCrow: LLM Agent with Chemistry Tools]]
* [[reasoning_via_planning|RAP: Reasoning via Planning]]
===== References =====