====== Computer Use Benchmark ====== Computer Use Benchmarks evaluate AI agents on their ability to interact with graphical user interfaces (GUIs) to complete real-world tasks. Unlike text-based benchmarks, these measure whether agents can click buttons, type text, navigate menus, and complete multi-step workflows autonomously -- the same way a human would use a computer. ===== Overview ===== As AI agents move beyond text-based interactions toward operating computer interfaces directly, standardized benchmarks are needed to measure GUI proficiency. Several complementary benchmarks have emerged, each targeting different aspects of computer use: end-to-end workflow completion (CUB), desktop task automation (OSWorld), and visual element grounding (ScreenSpot). ===== CUB (Computer Use Benchmark) ===== CUB was developed by **Theta Software** as the first benchmark focused purely on UI tool use across professional domains. It consists of **106 end-to-end workflows across 7 industries**: * Business operations * Finance * E-commerce * Construction management * Consumer applications * Healthcare * Supply chain Each workflow represents a realistic scenario an office worker might encounter, such as updating CRM records from email data, ordering products from supplier websites, or generating reports in spreadsheets. **Results are sobering**: the highest overall CUB score as of late 2025 is **10.4%** (Writer's Action Agent). Claude Sonnet with thinking mode scored 3.7%. Most models failed to complete even a single task end-to-end in initial testing. ===== OSWorld ===== OSWorld tests agents on **369 real-world desktop tasks** spanning web applications, desktop software, OS file operations, and multi-app workflows. Tasks are executed in live Windows environments with execution-based evaluation. Key findings from OSWorld: * Best agents now complete approximately **45% of tasks**, up from 6% at launch * Performance varies dramatically across UI versions -- an agent with 90% success on Windows 11 may drop to 9% on Windows XP for identical tasks * Multi-app workflows remain significantly harder than single-app tasks ===== ScreenSpot and ScreenSpot-Pro ===== ScreenSpot benchmarks evaluate **visual GUI grounding** -- the ability to locate and identify specific interface elements from screenshots. ScreenSpot-Pro extends this to professional high-resolution displays with complex environments. These benchmarks measure point-in-box accuracy: given a natural language instruction (e.g., "click the submit button"), the agent must identify the correct pixel coordinates. This tests the foundational visual perception that all GUI agents require. Current agents perform approximately **72% below human performance** on visual grounding tasks, indicating that accurate UI perception remains a major bottleneck. ===== Evaluation Methodology ===== GUI benchmarks share common evaluation approaches: # Typical GUI benchmark evaluation pipeline class GUIBenchmarkEvaluator: def __init__(self, environment, tasks): self.env = environment # Live OS environment self.tasks = tasks def evaluate_agent(self, agent): results = [] for task in self.tasks: self.env.reset() # Clean environment state # Agent receives task description and screenshot observation = self.env.screenshot() for step in range(task.max_steps): action = agent.act(task.instruction, observation) observation, done = self.env.execute(action) if done: break # Execution-based: check actual system state success = task.verify(self.env.state) results.append({ "task": task.id, "success": success, "steps": step + 1, "consistency": self._multi_run_check(agent, task) }) return results Key evaluation dimensions include: * **Execution-based evaluation** - Success determined by actual system state, not text output * **Multi-step completion** - Planning and executing action sequences * **Consistency** - Multiple runs per task to measure reliability * **Domain diversity** - Tasks across industries and UI complexities ===== Emerging Standards ===== **CUBE (Common Unified Benchmark Environments)** is a proposed protocol built on MCP and Gym that aims to unify different agent benchmarks into a common evaluation framework, addressing fragmentation across CUB, OSWorld, and other benchmarks. ===== References ===== * [[https://thetasoftware.com/blog/introducing-cub/|Theta Software - Introducing CUB]] * [[https://o-mega.ai/articles/the-2025-2026-guide-to-ai-computer-use-benchmarks-and-top-ai-agents|Guide to AI Computer Use Benchmarks 2025-2026]] * [[https://likaixin2000.github.io/papers/ScreenSpot_Pro.pdf|ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use]] * [[https://arxiv.org/abs/2505.16518|OSWorld and Computer Use Agent Evaluation]] * [[https://arxiv.org/html/2603.15798v1|CUBE: Common Unified Benchmark Environments]] ===== See Also ===== * [[terminal_bench]] - CLI/DevOps agent benchmark (non-GUI counterpart) * [[gaia_benchmark]] - General AI assistant benchmark with tool-use evaluation * [[agent_simulation_environments]] - 3D environments for embodied agent evaluation * [[agent_observability]] - Monitoring agent behavior in production deployments