====== HLE w/ Tools (Humanloop Evals) ====== **HLE w/ Tools** (Humanloop Evals) is a benchmark designed to measure the performance of large language models in executing long-horizon tasks that require sequential tool use and multi-step reasoning. The benchmark evaluates models on their ability to decompose complex objectives into actionable steps, leverage external tools appropriately, and maintain coherent execution across extended task sequences. ===== Overview ===== HLE w/ Tools represents a category of evaluation frameworks focused on assessing practical capabilities beyond standard language understanding tasks. Long-horizon execution with tool use simulates real-world scenarios where language models must interact with external systems, APIs, or computational resources to complete objectives (([[https://arxiv.org/abs/2210.03629|Yao et al. - ReAct: Synergizing Reasoning and Acting in Language Models (2022]])). The benchmark measures whether models can effectively plan multi-step solutions, select appropriate tools from available options, interpret tool outputs, and adapt their strategy based on intermediate results. This contrasts with single-turn question-answering tasks and reflects the emerging paradigm of agentic AI systems that operate autonomously across extended workflows. ===== Benchmark Design and Metrics ===== HLE w/ Tools evaluates models across scenarios requiring sustained tool interaction over multiple reasoning steps. Performance is measured on the model's ability to successfully complete tasks without human intervention, factoring in both correctness of final outputs and efficiency of tool usage patterns. As of April 2026, the open-source state-of-the-art performance on HLE w/ Tools stands at 54.0, achieved by Kimi K2.6 (([[https://www.latent.space/p/ainews-moonshot-kimi-k26-the-worlds|Latent Space - Moonshot Kimi K2.6 Report (2026]])). ===== Technical Approach ===== Models evaluated on HLE w/ Tools must demonstrate several core capabilities. These include task decomposition into subtasks amenable to tool execution, tool selection from available options based on task requirements, parameter specification for tool calls, and error recovery when tool outputs indicate failures or unexpected results (([[https://arxiv.org/abs/2005.11401|Lewis et al. - Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020]])). The benchmark likely incorporates aspects of planning and reasoning frameworks that have become standard in recent language model architectures. Chain-of-thought prompting techniques enable models to explicitly reason through problem-solving steps before executing actions (([[https://arxiv.org/abs/2201.11903|Wei et al. - Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (2022]])). ===== Current Implementations and Performance ===== Kimi K2.6, developed by Moonshot AI, demonstrates leading performance on HLE w/ Tools with a score of 54.0 in the open-source category. This performance reflects advances in instruction-following capabilities and tool-use reasoning that have emerged in large language models trained with reinforcement learning from human feedback and instruction tuning methodologies (([[https://arxiv.org/abs/2109.01652|Wei et al. - Finetuned Language Models Are Zero-Shot Learners (2021]])). The benchmark serves as a competitive evaluation metric in the rapidly evolving landscape of AI model capabilities, driving development efforts toward more capable agentic systems. ===== See Also ===== * [[tool_search_mechanism|Tool Search Mechanism]] * [[tool_augmented_language_models|Tool-Augmented Language Models]] * [[proximal_labs_frontierswe|Proximal Labs FrontierSWE]] * [[toolllm|ToolLLM: Facilitating Large Language Models to Master 16,000+ Real-World APIs]] * [[vllm|vLLM]] ===== References =====