Computer Use agents (also called GUI agents or CUAs) are AI systems that autonomously interact with digital devices by perceiving screens and executing actions like clicking, typing, and navigation. Rather than operating through APIs or structured data, these agents interact with software the same way humans do — through the graphical user interface.
This capability represents a fundamental shift in AI, moving beyond text-based assistance to direct computer control across desktops, mobile devices, web browsers, and any software application.
GUI agents operate through multimodal vision-language models (VLMs) that process visual information from screenshots and execute primitive actions. The typical workflow involves:
Perception: Agents capture screenshots or access DOM information to understand the on-screen environment. There are three main approaches:
Reasoning: The model analyzes visual elements to determine their location, identity, and properties, then decides what action to take.
Action Execution: Agents control the mouse and keyboard to interact with identified elements — clicking buttons, filling forms, typing text, scrolling, and navigating.
Iteration: Multi-step tasks are completed through sequential action cycles, with the agent observing the result of each action before deciding the next step.
Anthropic Claude Computer Use: Launched in October 2024, Claude gained the ability to view screenshots, move the mouse cursor, click buttons, and type text. Claude computer use operates through the Anthropic API with a computer_20241022 tool type that accepts screen resolution parameters and returns action coordinates.
OpenAI Operator: OpenAI's browser-based agent that can navigate websites and complete tasks autonomously. Operator uses a custom model trained for web interaction and includes safety mechanisms to hand control back to users for sensitive actions.
Writer Action Agent (Palmyra X5): Currently leads both GAIA and CUB benchmarks as of 2025, described as a 'super agent' handling complex multi-step work autonomously.
CogAgent: Employs a high-resolution cross-module to process small icons and text, enhancing efficiency for GUI tasks including DOM element generation and action prediction.
| Benchmark | Description | Top Score (2025) |
|---|---|---|
| CUB (Computer Use Benchmark) | 106 workflows across 7 industries | 10.4% (Writer Action Agent) |
| OSWorld | Realistic OS environment for multimodal agents | Active evaluation |
| WebArena | 804 web tasks across 4 categories | 61.7% (IBM CUGA) |
| GAIA Level 3 | General AI assistant reasoning | Writer Action Agent leads |
| Mind2Web | 2,350 tasks on 137 live websites | ~40-50% top scores |
The relatively low CUB scores (10.4% best) highlight that autonomous computer use remains a challenging frontier.
from anthropic import Anthropic import base64 client = Anthropic() # Define the computer use tool computer_tool = { "type": "computer_20241022", "name": "computer", "display_width_px": 1920, "display_height_px": 1080, "display_number": 1 } # Send a task with a screenshot with open('screenshot.png', 'rb') as f: screenshot_b64 = base64.standard_b64encode(f.read()).decode() response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, tools=[computer_tool], messages=[{ 'role': 'user', 'content': [ {'type': 'text', 'text': 'Click the Submit button'}, {'type': 'image', 'source': { 'type': 'base64', 'media_type': 'image/png', 'data': screenshot_b64 }} ] }] ) # Response contains action coordinates: {type: 'click', x: 450, y: 320}
Computer use agents raise significant safety and security concerns: