api_bank_benchmark

API-Bank Benchmark

API-Bank is a comprehensive benchmark introduced by Li et al., 2023 for evaluating the tool-use capabilities of large language models across a diverse set of real-world APIs. It provides both evaluation data and training data, making it one of the first complete benchmarks for tool-augmented LLM research. API-Bank addresses three key questions: how effective are current LLMs at using tools, how can tool use be improved, and what obstacles remain.¹⁾²⁾³⁾

Paper: arXiv:2304.08244 (April 2023)
Venue: EMNLP 2023
Authors: Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, Yongbin Li
Institution: Alibaba Group, HKUST, Peking University

Benchmark Components

Evaluation System

73 API tools spanning domains including search, weather, finance, social media, email, calendar, and more
314 annotated tool-use dialogues with 753 API calls
Runnable evaluation: APIs are actually executable, not just simulated
Tests three core capabilities: planning (deciding which APIs to call), retrieval (finding the right API from the available set), and calling (generating correct parameters)

Training Dataset

1,888 tool-use dialogues constructed from 2,138 APIs across 1,000 distinct domains
Designed for fine-tuning models to improve tool utilization
Used to train Lynx, a tool-augmented LLM initialized from Alpaca

Evaluation Dimensions

API-Bank tests LLMs across three levels of increasing difficulty:

Level 1 - API Calling: Can the model correctly call a single API given its documentation? Tests parameter extraction and formatting.

Level 2 - API Retrieval + Calling: Can the model identify the correct API from a set and call it properly? Tests tool selection from multiple options.

Level 3 - Planning + Retrieval + Calling: Can the model decompose a complex request into multiple API calls, retrieve the right APIs, and execute them in order? Tests multi-step tool-use reasoning.

Key Results

GPT-4 excels at planning (Level 3) tasks, outperforming other models on complex multi-step tool use
GPT-3.5 shows improved tool utilization compared to GPT-3, confirming that scaling helps tool use
Lynx (fine-tuned Alpaca) surpasses base Alpaca by 26+ points on tool utilization metrics, approaching GPT-3.5 effectiveness⁴⁾
Significant room for improvement remains across all models, particularly in:
- Correctly parsing API documentation
- Handling ambiguous user requests
- Managing multi-step API dependencies

Importance for Tool-Use Evaluation

API-Bank was among the first benchmarks to provide:

Executable evaluation: Real API calls, not just string matching against expected outputs
Multi-level assessment: Separate measurement of planning, retrieval, and execution capabilities
Training data alongside evaluation: Enabling both assessment and improvement of tool use
Domain diversity: 1,000 domains covering real-world API usage patterns

It established the methodology for subsequent tool-use benchmarks like ToolBench, MINT, and T-Eval, and informed the development of function calling APIs and self-supervised tool learning approaches.

References

¹⁾ , ⁴⁾

https://arxiv.org/abs/2304.08244|Li et al., 2023 - API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

²⁾

Schick, T. et al. “Toolformer: Language Models Can Teach Themselves to Use Tools.” arXiv:2302.04761, 2023.

³⁾

Qin, Y. et al. “ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs.” arXiv:2307.16789, 2023.

Table of Contents