Table of Contents

Computer Use

Computer Use agents (also called GUI agents or CUAs) are AI systems that autonomously interact with digital devices by perceiving screens and executing actions like clicking, typing, and navigation. Rather than operating through APIs or structured data, these agents interact with software the same way humans do — through the graphical user interface.

This capability represents a fundamental shift in AI, moving beyond text-based assistance to direct computer control across desktops, mobile devices, web browsers, and any software application.

How Computer Use Works

GUI agents operate through multimodal vision-language models (VLMs) that process visual information from screenshots and execute primitive actions. The typical workflow involves:

Perception: Agents capture screenshots or access DOM information to understand the on-screen environment. There are three main approaches:

Reasoning: The model analyzes visual elements to determine their location, identity, and properties, then decides what action to take.

Action Execution: Agents control the mouse and keyboard to interact with identified elements — clicking buttons, filling forms, typing text, scrolling, and navigating.

Iteration: Multi-step tasks are completed through sequential action cycles, with the agent observing the result of each action before deciding the next step.

Key Implementations

Anthropic Claude Computer Use: Launched in October 2024, Claude gained the ability to view screenshots, move the mouse cursor, click buttons, and type text. Claude computer use operates through the Anthropic API with a computer_20241022 tool type that accepts screen resolution parameters and returns action coordinates.

OpenAI Operator: OpenAI's browser-based agent that can navigate websites and complete tasks autonomously. Operator uses a custom model trained for web interaction and includes safety mechanisms to hand control back to users for sensitive actions.

Writer Action Agent (Palmyra X5): Currently leads both GAIA and CUB benchmarks as of 2025, described as a 'super agent' handling complex multi-step work autonomously.

CogAgent: Employs a high-resolution cross-module to process small icons and text, enhancing efficiency for GUI tasks including DOM element generation and action prediction.

Benchmarks

Benchmark Description Top Score (2025)
CUB (Computer Use Benchmark) 106 workflows across 7 industries 10.4% (Writer Action Agent)
OSWorld Realistic OS environment for multimodal agents Active evaluation
WebArena 804 web tasks across 4 categories 61.7% (IBM CUGA)
GAIA Level 3 General AI assistant reasoning Writer Action Agent leads
Mind2Web 2,350 tasks on 137 live websites ~40-50% top scores

The relatively low CUB scores (10.4% best) highlight that autonomous computer use remains a challenging frontier.

Code Example

from anthropic import Anthropic
import base64
 
client = Anthropic()
 
# Define the computer use tool
computer_tool = {
    "type": "computer_20241022",
    "name": "computer",
    "display_width_px": 1920,
    "display_height_px": 1080,
    "display_number": 1
}
 
# Send a task with a screenshot
with open('screenshot.png', 'rb') as f:
    screenshot_b64 = base64.standard_b64encode(f.read()).decode()
 
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    tools=[computer_tool],
    messages=[{
        'role': 'user',
        'content': [
            {'type': 'text', 'text': 'Click the Submit button'},
            {'type': 'image', 'source': {
                'type': 'base64',
                'media_type': 'image/png',
                'data': screenshot_b64
            }}
        ]
    }]
)
# Response contains action coordinates: {type: 'click', x: 450, y: 320}

Safety Considerations

Computer use agents raise significant safety and security concerns:

References

See Also