====== Computer Use Benchmark ====== Computer Use Benchmarks evaluate AI agents on their ability to interact with graphical user interfaces (GUIs) to complete real-world tasks.(([[https://o-mega.ai/articles/the-2025-2026-guide-to-ai-computer-use-benchmarks-and-top-ai-agents|Guide to AI Computer Use Benchmarks 2025-2026]])) Unlike text-based benchmarks, these measure whether agents can click buttons, type text, navigate menus, and complete multi-step workflows autonomously, the same way a human would use a computer. ===== Overview ===== As AI agents move beyond text-based interactions toward operating computer interfaces directly, standardized benchmarks are needed to measure GUI proficiency. Several complementary benchmarks have emerged, each targeting different aspects of computer use: end-to-end workflow completion (CUB), desktop task automation (OSWorld), and visual element grounding (ScreenSpot).(([[https://arxiv.org/abs/2505.16518|OSWorld and Computer Use Agent Evaluation]])) ===== CUB (Computer Use Benchmark) ===== CUB was developed by **Theta Software** as the first benchmark focused purely on UI tool use across professional domains.(([[https://thetasoftware.com/blog/introducing-cub/|Theta Software - Introducing CUB]])) It consists of **106 end-to-end workflows across 7 industries**: * Business operations * Finance * E-commerce * Construction management * Consumer applications * Healthcare * Supply chain Each workflow represents a realistic scenario an office worker might encounter, such as updating CRM records from email data, ordering products from supplier websites, or generating reports in spreadsheets. **Results are sobering**: the highest overall CUB score as of late 2025 is **10.4%** (Writer's Action Agent). [[claude|Claude]] Sonnet with thinking mode scored 3.7%. Most models failed to complete even a single task end-to-end in initial testing. ===== OSWorld ===== OSWorld tests agents on **369 real-world desktop tasks** spanning web applications, desktop software, OS file operations, and multi-app workflows. Tasks are executed in live Windows environments with execution-based evaluation. Key findings from OSWorld: * Best agents now complete approximately **45% of tasks**, up from 6% at launch * Performance varies dramatically across UI versions, an agent with 90% success on Windows 11 may drop to 9% on Windows XP for identical tasks * Multi-app workflows remain significantly harder than single-app tasks ===== ScreenSpot and ScreenSpot-Pro ===== ScreenSpot benchmarks evaluate **visual GUI grounding**, the ability to locate and identify specific interface elements from screenshots. ScreenSpot-Pro extends this to professional high-resolution displays with complex environments.(([[https://likaixin2000.[[github|github]])).io/papers/ScreenSpot_Pro.pdf|ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use]])) These benchmarks measure point-in-box accuracy: given a natural language instruction (e.g., "click the submit [[button_device|button]]"), the agent must identify the correct pixel coordinates. This tests the foundational visual perception that all GUI agents require. Current agents perform approximately **72% below human performance** on visual grounding tasks, indicating that accurate UI perception remains a major bottleneck. ===== Evaluation Methodology ===== GUI benchmarks share common evaluation approaches: # Typical GUI benchmark evaluation pipeline class GUIBenchmarkEvaluator: def __init__(self, environment, tasks): self.env = environment # Live OS environment self.tasks = tasks def evaluate_agent(self, agent): results = [] for task in self.tasks: self.env.reset() # Clean environment state # Agent receives task description and screenshot observation = self.env.screenshot() for step in range(task.max_steps): action = agent.act(task.instruction, observation) observation, done = self.env.execute(action) if done: break # Execution-based: check actual system state success = task.verify(self.env.state) results.append({ "task": task.id, "success": success, "steps": step + 1, "consistency": self._multi_run_check(agent, task) }) return results Key evaluation dimensions include: * **Execution-based evaluation** - Success determined by actual system state, not text output * **Multi-step completion** - Planning and executing action sequences * **Consistency** - Multiple runs per task to measure reliability * **Domain diversity** - Tasks across industries and UI complexities ===== Emerging Standards ===== **CUBE (Common Unified Benchmark Environments)** is a proposed protocol built on MCP and Gym that aims to unify different agent benchmarks into a common evaluation framework, addressing fragmentation across CUB, OSWorld, and other benchmarks.(([[https://arxiv.org/html/2603.15798v1|CUBE: Common Unified Benchmark Environments]])) ===== See Also ===== * [[ai_coding_benchmarks|AI Coding Performance Benchmarks]] * [[agent_benchmark_blind_spots|Benchmarks for Agent Blind Spots]] * [[toolathlon|Toolathlon]] * [[real_work_automation_benchmarking|Real Work Automation Benchmarking]] * [[benchmark_exploitation|Benchmark Exploitation]] ===== References =====