Table of Contents

API-Bank Benchmark

API-Bank is a comprehensive benchmark introduced by Li et al., 2023 for evaluating the tool-use capabilities of large language models across a diverse set of real-world APIs. It provides both evaluation data and training data, making it one of the first complete benchmarks for tool-augmented LLM research. API-Bank addresses three key questions: how effective are current LLMs at using tools, how can tool use be improved, and what obstacles remain.1)2)3)

Benchmark Components

Evaluation System

Training Dataset

Evaluation Dimensions

API-Bank tests LLMs across three levels of increasing difficulty:

Level 1 - API Calling: Can the model correctly call a single API given its documentation? Tests parameter extraction and formatting.

Level 2 - API Retrieval + Calling: Can the model identify the correct API from a set and call it properly? Tests tool selection from multiple options.

Level 3 - Planning + Retrieval + Calling: Can the model decompose a complex request into multiple API calls, retrieve the right APIs, and execute them in order? Tests multi-step tool-use reasoning.

Key Results

Importance for Tool-Use Evaluation

API-Bank was among the first benchmarks to provide:

It established the methodology for subsequent tool-use benchmarks like ToolBench, MINT, and T-Eval, and informed the development of function calling APIs and self-supervised tool learning approaches.

See Also

References

1) , 4)
https://arxiv.org/abs/2304.08244|Li et al., 2023 - API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs
2)
Schick, T. et al. “Toolformer: Language Models Can Teach Themselves to Use Tools.” arXiv:2302.04761, 2023.
3)
Qin, Y. et al. “ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs.” arXiv:2307.16789, 2023.