====== Computer Use Benchmark ======
Computer Use Benchmarks evaluate AI agents on their ability to interact with graphical user interfaces (GUIs) to complete real-world tasks.(([[https://o-mega.ai/articles/the-2025-2026-guide-to-ai-computer-use-benchmarks-and-top-ai-agents|Guide to AI Computer Use Benchmarks 2025-2026]])) Unlike text-based benchmarks, these measure whether agents can click buttons, type text, navigate menus, and complete multi-step workflows autonomously, the same way a human would use a computer.

===== Overview =====
As AI agents move beyond text-based interactions toward operating computer interfaces directly, standardized benchmarks are needed to measure GUI proficiency. Several complementary benchmarks have emerged, each targeting different aspects of computer use: end-to-end workflow completion (CUB), desktop task automation (OSWorld), and visual element grounding (ScreenSpot).(([[https://arxiv.org/abs/2505.16518|OSWorld and Computer Use Agent Evaluation]]))

===== CUB (Computer Use Benchmark) =====
CUB was developed by **Theta Software** as the first benchmark focused purely on UI tool use across professional domains.(([[https://thetasoftware.com/blog/introducing-cub/|Theta Software - Introducing CUB]])) It consists of **106 end-to-end workflows across 7 industries**:

  * Business operations
  * Finance
  * E-commerce
  * Construction management
  * Consumer applications
  * Healthcare
  * Supply chain

Each workflow represents a realistic scenario an office worker might encounter, such as updating CRM records from email data, ordering products from supplier websites, or generating reports in spreadsheets.

**Results are sobering**: the highest overall CUB score as of late 2025 is **10.4%** (Writer's Action Agent). [[claude|Claude]] Sonnet with thinking mode scored 3.7%. Most models failed to complete even a single task end-to-end in initial testing.

===== OSWorld =====
OSWorld tests agents on **369 real-world desktop tasks** spanning web applications, desktop software, OS file operations, and multi-app workflows. Tasks are executed in live Windows environments with execution-based evaluation.

Key findings from OSWorld:

  * Best agents now complete approximately **45% of tasks**, up from 6% at launch
  * Performance varies dramatically across UI versions, an agent with 90% success on Windows 11 may drop to 9% on Windows XP for identical tasks
  * Multi-app workflows remain significantly harder than single-app tasks

===== ScreenSpot and ScreenSpot-Pro =====
ScreenSpot benchmarks evaluate **visual GUI grounding**, the ability to locate and identify specific interface elements from screenshots. ScreenSpot-Pro extends this to professional high-resolution displays with complex environments.(([[https://likaixin2000.[[github|github]])).io/papers/ScreenSpot_Pro.pdf|ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use]]))

These benchmarks measure point-in-box accuracy: given a natural language instruction (e.g., "click the submit [[button_device|button]]"), the agent must identify the correct pixel coordinates. This tests the foundational visual perception that all GUI agents require.

Current agents perform approximately **72% below human performance** on visual grounding tasks, indicating that accurate UI perception remains a major bottleneck.

===== Evaluation Methodology =====
GUI benchmarks share common evaluation approaches:

<code python>
# Typical GUI benchmark evaluation pipeline
class GUIBenchmarkEvaluator:
    def __init__(self, environment, tasks):
        self.env = environment  # Live OS environment
        self.tasks = tasks
    
    def evaluate_agent(self, agent):
        results = []
        for task in self.tasks:
            self.env.reset()  # Clean environment state
            
            # Agent receives task description and screenshot
            observation = self.env.screenshot()
            for step in range(task.max_steps):
                action = agent.act(task.instruction, observation)
                observation, done = self.env.execute(action)
                if done:
                    break
            
            # Execution-based: check actual system state
            success = task.verify(self.env.state)
            results.append({
                "task": task.id,
                "success": success,
                "steps": step + 1,
                "consistency": self._multi_run_check(agent, task)
            })
        return results
</code>

Key evaluation dimensions include:

  * **Execution-based evaluation** - Success determined by actual system state, not text output
  * **Multi-step completion** - Planning and executing action sequences
  * **Consistency** - Multiple runs per task to measure reliability
  * **Domain diversity** - Tasks across industries and UI complexities

===== Emerging Standards =====
**CUBE (Common Unified Benchmark Environments)** is a proposed protocol built on MCP and Gym that aims to unify different agent benchmarks into a common evaluation framework, addressing fragmentation across CUB, OSWorld, and other benchmarks.(([[https://arxiv.org/html/2603.15798v1|CUBE: Common Unified Benchmark Environments]]))

===== See Also =====
  * [[ai_coding_benchmarks|AI Coding Performance Benchmarks]]
  * [[agent_benchmark_blind_spots|Benchmarks for Agent Blind Spots]]
  * [[toolathlon|Toolathlon]]
  * [[real_work_automation_benchmarking|Real Work Automation Benchmarking]]
  * [[benchmark_exploitation|Benchmark Exploitation]]

===== References =====