Computer Use Benchmark

Computer Use Benchmarks evaluate AI agents on their ability to interact with graphical user interfaces (GUIs) to complete real-world tasks.¹⁾ Unlike text-based benchmarks, these measure whether agents can click buttons, type text, navigate menus, and complete multi-step workflows autonomously, the same way a human would use a computer.

Overview

As AI agents move beyond text-based interactions toward operating computer interfaces directly, standardized benchmarks are needed to measure GUI proficiency. Several complementary benchmarks have emerged, each targeting different aspects of computer use: end-to-end workflow completion (CUB), desktop task automation (OSWorld), and visual element grounding (ScreenSpot).²⁾

CUB (Computer Use Benchmark)

CUB was developed by Theta Software as the first benchmark focused purely on UI tool use across professional domains.³⁾ It consists of 106 end-to-end workflows across 7 industries:

Business operations
Finance
E-commerce
Construction management
Consumer applications
Healthcare
Supply chain

Each workflow represents a realistic scenario an office worker might encounter, such as updating CRM records from email data, ordering products from supplier websites, or generating reports in spreadsheets.

Results are sobering: the highest overall CUB score as of late 2025 is 10.4% (Writer's Action Agent). Claude Sonnet with thinking mode scored 3.7%. Most models failed to complete even a single task end-to-end in initial testing.

OSWorld

OSWorld tests agents on 369 real-world desktop tasks spanning web applications, desktop software, OS file operations, and multi-app workflows. Tasks are executed in live Windows environments with execution-based evaluation.

Key findings from OSWorld:

Best agents now complete approximately 45% of tasks, up from 6% at launch
Performance varies dramatically across UI versions, an agent with 90% success on Windows 11 may drop to 9% on Windows XP for identical tasks
Multi-app workflows remain significantly harder than single-app tasks

ScreenSpot and ScreenSpot-Pro

ScreenSpot benchmarks evaluate visual GUI grounding, the ability to locate and identify specific interface elements from screenshots. ScreenSpot-Pro extends this to professional high-resolution displays with complex environments.⁴⁾.io/papers/ScreenSpot_Pro.pdf|ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use]]))

These benchmarks measure point-in-box accuracy: given a natural language instruction (e.g., “click the submit button”), the agent must identify the correct pixel coordinates. This tests the foundational visual perception that all GUI agents require.

Current agents perform approximately 72% below human performance on visual grounding tasks, indicating that accurate UI perception remains a major bottleneck.

Evaluation Methodology

GUI benchmarks share common evaluation approaches:

# Typical GUI benchmark evaluation pipeline
class GUIBenchmarkEvaluator:
    def __init__(self, environment, tasks):
        self.env = environment  # Live OS environment
        self.tasks = tasks
 
    def evaluate_agent(self, agent):
        results = []
        for task in self.tasks:
            self.env.reset()  # Clean environment state
 
            # Agent receives task description and screenshot
            observation = self.env.screenshot()
            for step in range(task.max_steps):
                action = agent.act(task.instruction, observation)
                observation, done = self.env.execute(action)
                if done:
                    break
 
            # Execution-based: check actual system state
            success = task.verify(self.env.state)
            results.append({
                "task": task.id,
                "success": success,
                "steps": step + 1,
                "consistency": self._multi_run_check(agent, task)
            })
        return results

Key evaluation dimensions include:

Execution-based evaluation - Success determined by actual system state, not text output
Multi-step completion - Planning and executing action sequences
Consistency - Multiple runs per task to measure reliability
Domain diversity - Tasks across industries and UI complexities

Emerging Standards

CUBE (Common Unified Benchmark Environments) is a proposed protocol built on MCP and Gym that aims to unify different agent benchmarks into a common evaluation framework, addressing fragmentation across CUB, OSWorld, and other benchmarks.⁵⁾

References

¹⁾

Guide to AI Computer Use Benchmarks 2025-2026

²⁾

OSWorld and Computer Use Agent Evaluation

³⁾

Theta Software - Introducing CUB

⁴⁾

github

⁵⁾

CUBE: Common Unified Benchmark Environments

AI Agent Knowledge Base

Sidebar

Table of Contents

Computer Use Benchmark

Overview

CUB (Computer Use Benchmark)

OSWorld

ScreenSpot and ScreenSpot-Pro

Evaluation Methodology

Emerging Standards

See Also

References

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Computer Use Benchmark

Overview

CUB (Computer Use Benchmark)

OSWorld

ScreenSpot and ScreenSpot-Pro

Evaluation Methodology

Emerging Standards

See Also

References

Page Tools