====== Computer Use ====== **Computer Use agents** (also called GUI agents or CUAs) are AI systems that autonomously interact with digital devices by perceiving screens and executing actions like clicking, typing, and navigation. Rather than operating through APIs or structured data, these agents interact with software the same way humans do — through the graphical user interface. This capability represents a fundamental shift in AI, moving beyond text-based assistance to direct computer control across desktops, mobile devices, web browsers, and any software application. ===== How Computer Use Works ===== GUI agents operate through multimodal vision-language models (VLMs) that process visual information from screenshots and execute primitive actions. The typical workflow involves: **Perception**: Agents capture screenshots or access DOM information to understand the on-screen environment. There are three main approaches: * **Screenshot-based**: Pure visual analysis of screen images (most general, works with any application) * **HTML/DOM-based**: Processing structured textual representations of web pages (more efficient but limited to browsers) * **Hybrid**: Combining visual and textual inputs for robust cross-environment performance **Reasoning**: The model analyzes visual elements to determine their location, identity, and properties, then decides what action to take. **Action Execution**: Agents control the mouse and keyboard to interact with identified elements — clicking buttons, filling forms, typing text, scrolling, and navigating. **Iteration**: Multi-step tasks are completed through sequential action cycles, with the agent observing the result of each action before deciding the next step. ===== Key Implementations ===== **Anthropic Claude Computer Use**: Launched in October 2024, Claude gained the ability to view screenshots, move the mouse cursor, click buttons, and type text. Claude computer use operates through the Anthropic API with a ''computer_20241022'' tool type that accepts screen resolution parameters and returns action coordinates. **OpenAI Operator**: OpenAI's browser-based agent that can navigate websites and complete tasks autonomously. Operator uses a custom model trained for web interaction and includes safety mechanisms to hand control back to users for sensitive actions. **Writer Action Agent (Palmyra X5)**: Currently leads both GAIA and CUB benchmarks as of 2025, described as a 'super agent' handling complex multi-step work autonomously. **CogAgent**: Employs a high-resolution cross-module to process small icons and text, enhancing efficiency for GUI tasks including DOM element generation and action prediction. ===== Benchmarks ===== ^ Benchmark ^ Description ^ Top Score (2025) ^ | CUB (Computer Use Benchmark) | 106 workflows across 7 industries | 10.4% (Writer Action Agent) | | OSWorld | Realistic OS environment for multimodal agents | Active evaluation | | WebArena | 804 web tasks across 4 categories | 61.7% (IBM CUGA) | | GAIA Level 3 | General AI assistant reasoning | Writer Action Agent leads | | Mind2Web | 2,350 tasks on 137 live websites | ~40-50% top scores | The relatively low CUB scores (10.4% best) highlight that autonomous computer use remains a challenging frontier. ===== Code Example ===== from anthropic import Anthropic import base64 client = Anthropic() # Define the computer use tool computer_tool = { "type": "computer_20241022", "name": "computer", "display_width_px": 1920, "display_height_px": 1080, "display_number": 1 } # Send a task with a screenshot with open('screenshot.png', 'rb') as f: screenshot_b64 = base64.standard_b64encode(f.read()).decode() response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, tools=[computer_tool], messages=[{ 'role': 'user', 'content': [ {'type': 'text', 'text': 'Click the Submit button'}, {'type': 'image', 'source': { 'type': 'base64', 'media_type': 'image/png', 'data': screenshot_b64 }} ] }] ) # Response contains action coordinates: {type: 'click', x: 450, y: 320} ===== Safety Considerations ===== Computer use agents raise significant safety and security concerns: * **Privacy**: Screenshot-based perception may capture sensitive information (passwords, personal data, financial information) * **Unintended Actions**: Agents may click wrong elements or perform destructive actions * **Sandboxing**: Most implementations run in isolated virtual environments to prevent real-world damage * **Human Oversight**: Critical actions typically require human approval before execution * **Prompt Injection**: Malicious content on screen could manipulate agent behavior ===== References ===== * [[https://arxiv.org/abs/2501.16150|Comprehensive Survey on Computer Use Agents (2025)]] * [[https://aclanthology.org/2025.findings-acl.1158.pdf|ACL 2025 — GUI Agent Survey]] * [[https://docs.anthropic.com/en/docs/build-with-claude/computer-use|Anthropic — Computer Use Documentation]] * [[https://o-mega.ai/articles/the-2025-2026-guide-to-ai-computer-use-benchmarks-and-top-ai-agents|Guide to Computer Use Benchmarks 2025]] ===== See Also ===== * [[claude_agent_sdk]] — Claude Agent SDK with computer use support * [[agent_evaluation]] — AI agent benchmarks and evaluation * [[devin]] — Devin autonomous software engineer * [[multi_agent_systems]] — Multi-agent system architectures