Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
ToolLLM is a general-purpose tool-use framework introduced by Qin et al. (2023)1) that enables open-source LLMs to effectively leverage a massive collection of real-world APIs. The paper has accumulated over 1,305 citations, establishing it as a foundational work in LLM tool-use research, building on earlier work like Toolformer2)) and Large Language Models as Tool Makers3))). The framework addresses a critical gap between closed-source models (e.g., ChatGPT) and open-source alternatives by providing a comprehensive data construction, training, and evaluation pipeline.
ToolBench is a large-scale instruction-tuning dataset containing 126,486 (instruction, solution path) pairs covering 16,464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub4). The dataset was automatically constructed using ChatGPT (gpt-3.5-turbo-16k) through a three-phase process:
DFSDT is a novel reasoning algorithm that expands the search space beyond linear chain-of-thought by evaluating multiple reasoning traces in a depth-first search manner.
The search process can be formalized as:
$$T^* = \arg\max_{T \in \mathcal{T}} R(T)$$
where $T$ is a reasoning trace in the search tree $\mathcal{T}$ and $R(T)$ is the reward (successful API execution). At each node, the LLM generates candidate actions, expands promising branches, and backtracks from failed paths.
ToolEval is an automatic evaluation framework using two metrics:
Both metrics show high correlation with human evaluation.
# Simplified ToolLLM inference loop with DFSDT def dfsdt_search(instruction, api_retriever, llm, max_depth=5): apis = api_retriever.retrieve(instruction, top_k=5) root = Node(state=instruction, apis=apis) stack = [root] best_path = None while stack: node = stack.pop() if node.depth >= max_depth: continue # LLM generates candidate actions (API calls) actions = llm.generate_actions(node.state, node.apis) for action in actions: observation = execute_api_call(action) child = Node( state=update_state(node.state, action, observation), parent=node, depth=node.depth + 1 ) if is_task_complete(child.state): path = reconstruct_path(child) if best_path is None or score(path) > score(best_path): best_path = path else: stack.append(child) return best_path