Table of Contents

GUI Grounding Agents

GUI grounding agents are AI systems that interact with graphical user interfaces through visual understanding, enabling them to locate UI elements on screen and perform actions like clicking, typing, and scrolling based on natural language instructions. Rather than relying on structured APIs or accessibility trees, these agents parse raw screenshots to understand and manipulate interfaces.

Overview

The challenge of GUI grounding lies in bridging the gap between natural language intent and precise pixel-level interaction. A user might say “click the submit button” and the agent must visually identify the button's location on screen and generate the correct coordinates. This requires visual perception, spatial reasoning, and an understanding of UI conventions across platforms (web, desktop, mobile).

Recent advances in vision-language models (VLMs) have made pure-vision approaches competitive with or superior to methods that rely on HTML parsing or accessibility trees, opening the path to truly universal GUI automation.

Key Systems

CogAgent

CogAgent is a multimodal foundation model designed specifically for GUI understanding. It processes high-resolution screenshots and supports both screen parsing and action prediction. CogAgent serves as a base architecture that other GUI agents build upon, demonstrating that visual-only approaches can effectively ground UI elements.

WebVoyager

WebVoyager focuses on web-based GUI automation, using hierarchical planning to decompose complex web tasks into sequences of grounded actions. It excels at multi-step browser interactions by maintaining a plan structure while adapting to dynamic page content.

Agent S2

Agent S2 introduces a Mixture-of-Grounding architecture combining generalist and specialist grounding models with Proactive Hierarchical Planning. This compositional framework achieves state-of-the-art results on OSWorld, WindowsAgentArena, and AndroidWorld benchmarks by adapting its grounding strategy to different UI contexts.

SE-GUI

SE-GUI applies self-evolutionary reinforcement learning fine-tuning with dense policy gradients. Starting from Qwen2.5-VL-7B with only 3,000 seed samples, SE-GUI achieves remarkable results — 88.2% on ScreenSpot and 47.3% on ScreenSpot-Pro, beating the much larger UI-TARS-72B model by 24.2% on the professional benchmark.

AgentCPM-GUI

AgentCPM-GUI uses a three-stage training pipeline: grounding pre-training on 12 million samples, supervised fine-tuning on 55,000 trajectories, and GRPO reinforcement learning. Focused on mobile/Android environments, it achieves 96.9% type-match accuracy on the CAGUI Chinese GUI benchmark.

Core Techniques

Visual Grounding

Visual grounding maps natural language instructions to screen coordinates. The agent must identify which UI element corresponds to the instruction and predict a precise click point. Methods range from direct coordinate regression to attention-based region proposals.

Screen Parsing

Agents analyze raw screenshots to build structured representations of UI elements without access to the DOM or accessibility APIs. Techniques include edge detection, Information-Sensitive Cropping (ISC), and learned element detectors.

Action Prediction

Given a grounded element, the agent predicts the appropriate action (click, type, scroll, drag) and its parameters. JSON-structured action spaces provide a clean interface between the vision model and the execution environment.

Test-Time Scaling

RegionFocus introduces test-time scaling for GUI grounding by extracting sub-regions around uncertain predictions and re-analyzing them at higher resolution. Applied to 72B parameter models, this iterative refinement achieves 61.6% on ScreenSpot-Pro, the current state of the art.

Benchmarks

# Key GUI grounding benchmarks and representative scores
benchmarks = {
    "ScreenSpot":     {"SE-GUI-7B": 88.2, "description": "Multi-platform element grounding"},
    "ScreenSpot-v2":  {"SE-GUI-7B": 90.25, "description": "Updated multi-platform grounding"},
    "ScreenSpot-Pro": {"RegionFocus": 61.6, "SE-GUI-7B": 47.3,
                       "description": "Professional apps, 23 apps, 5 domains, 3 OS"},
    "OSWorld":        {"Agent-S2": "SOTA", "description": "Desktop OS task completion"},
    "AndroidWorld":   {"Agent-S2": "SOTA", "description": "Mobile task completion"},
    "CAGUI":          {"AgentCPM-GUI": 96.9, "description": "Chinese GUI type-match"},
}

Architecture Patterns

The dominant architecture for GUI grounding agents follows this pipeline:

  1. Screenshot encoding — A vision encoder (e.g., ViT or SigLIP) processes the screen image
  2. Instruction fusion — Natural language instructions are fused with visual features via cross-attention
  3. Element grounding — The model predicts bounding boxes or click coordinates for target elements
  4. Action generation — An action decoder produces structured actions (click, type, scroll) with parameters
  5. Feedback loop — The new screen state is captured and fed back for multi-step interaction

Challenges

References

See Also