====== ToolLLM: Facilitating Large Language Models to Master 16,000+ Real-World APIs ====== **ToolLLM** is a general-purpose tool-use framework introduced by Qin et al. (2023)(([[https://arxiv.org/abs/2307.16789|Qin et al. "ToolLLM: Facilitating Large Language Models to Master 16,000+ Real-World APIs" (2023)]])) that enables open-source LLMs to effectively leverage a massive collection of real-world APIs. The paper has accumulated over **1,305 citations**, establishing it as a foundational work in LLM tool-use research, building on earlier work like Toolformer((([[https://arxiv.org/abs/2302.04761|Schick et al. "Toolformer: Language Models Can Teach Themselves to Use Tools" (2023).]]))) and Large Language Models as Tool Makers((([[https://arxiv.org/abs/2305.17126|Cai et al. "Large Language Models as Tool Makers" (2023).]])))). The framework addresses a critical gap between closed-source models (e.g., ChatGPT) and open-source alternatives by providing a comprehensive data construction, training, and evaluation pipeline. [[https://arxiv.org/abs/2307.16789|arXiv:2307.16789]] ===== Core Components ===== ==== ToolBench Dataset ==== ToolBench is a large-scale instruction-tuning dataset containing **126,486 (instruction, solution path) pairs** covering **16,464 real-world RESTful APIs** spanning 49 categories from RapidAPI Hub(([[https://github.com/OpenBMB/ToolBench|ToolBench GitHub Repository]])). The dataset was automatically constructed using ChatGPT (gpt-3.5-turbo-16k) through a three-phase process: * **Single-tool instructions**: Tasks solvable with a single API * **Intra-category multi-tool instructions**: Tasks requiring multiple APIs from the same category * **Intra-collection multi-tool instructions**: Tasks spanning APIs across different categories ==== DFSDT: Depth-First Search-based Decision Tree ==== DFSDT is a novel reasoning algorithm that expands the search space beyond linear chain-of-thought by evaluating multiple reasoning traces in a depth-first search manner. The search process can be formalized as: $$T^* = \arg\max_{T \in \mathcal{T}} R(T)$$ where $T$ is a reasoning trace in the search tree $\mathcal{T}$ and $R(T)$ is the reward (successful API execution). At each node, the LLM generates candidate actions, expands promising branches, and backtracks from failed paths. ==== ToolEval Evaluation Framework ==== ToolEval is an automatic evaluation framework using two metrics: * **Pass Rate**: Percentage of instructions successfully executed within budget constraints * **Win Rate**: Pairwise comparison of solution path quality, judged by ChatGPT Both metrics show high correlation with human evaluation. ===== Architecture ===== # Simplified ToolLLM inference loop with DFSDT def dfsdt_search(instruction, api_retriever, llm, max_depth=5): apis = api_retriever.retrieve(instruction, top_k=5) root = Node(state=instruction, apis=apis) stack = [root] best_path = None while stack: node = stack.pop() if node.depth >= max_depth: continue # LLM generates candidate actions (API calls) actions = llm.generate_actions(node.state, node.apis) for action in actions: observation = execute_api_call(action) child = Node( state=update_state(node.state, action, observation), parent=node, depth=node.depth + 1 ) if is_task_complete(child.state): path = reconstruct_path(child) if best_path is None or score(path) > score(best_path): best_path = path else: stack.append(child) return best_path ===== System Architecture ===== graph TD A[User Instruction] --> B[API Retriever] B --> C[Relevant APIs] C --> D[DFSDT Reasoning Engine] D --> E{Generate Candidate Actions} E --> F[API Call Execution] F --> G{Task Complete?} G -- No --> H[Expand Search Tree] H --> E G -- Yes --> I[Return Best Solution Path] D --> J[Backtrack on Failure] J --> E K[ToolBench Training Data] --> L[Fine-tune LLaMA] L --> M[ToolLLaMA] M --> D ===== Key Results ===== * ToolLLaMA matches ChatGPT and approaches GPT-4 performance on ToolBench(([[https://arxiv.org/abs/2307.16789|Qin et al. "ToolLLM" (2023)]])) * Outperforms Text-Davinci-003 and Claude-2 on tool-use tasks * Strong zero-shot generalization to unseen APIs, tools, and categories * Neural API retriever effectively selects relevant APIs from 16K+ candidates ===== See Also ===== * [[llm_tool_makers|LATM: Large Language Models as Tool Makers]] * [[chemcrow|ChemCrow: LLM Agent with Chemistry Tools]] * [[reasoning_via_planning|RAP: Reasoning via Planning]] ===== References =====