GUI grounding agents are AI systems that interact with graphical user interfaces through visual understanding, enabling them to locate UI elements on screen and perform actions like clicking, typing, and scrolling based on natural language instructions. Rather than relying on structured APIs or accessibility trees, these agents parse raw screenshots to understand and manipulate interfaces.
The challenge of GUI grounding lies in bridging the gap between natural language intent and precise pixel-level interaction. A user might say “click the submit button” and the agent must visually identify the button's location on screen and generate the correct coordinates. This requires visual perception, spatial reasoning, and an understanding of UI conventions across platforms (web, desktop, mobile).
Recent advances in vision-language models (VLMs) have made pure-vision approaches competitive with or superior to methods that rely on HTML parsing or accessibility trees, opening the path to truly universal GUI automation.
CogAgent is a multimodal foundation model designed specifically for GUI understanding. It processes high-resolution screenshots and supports both screen parsing and action prediction. CogAgent serves as a base architecture that other GUI agents build upon, demonstrating that visual-only approaches can effectively ground UI elements.
WebVoyager focuses on web-based GUI automation, using hierarchical planning to decompose complex web tasks into sequences of grounded actions. It excels at multi-step browser interactions by maintaining a plan structure while adapting to dynamic page content.
Agent S2 introduces a Mixture-of-Grounding architecture combining generalist and specialist grounding models with Proactive Hierarchical Planning. This compositional framework achieves state-of-the-art results on OSWorld, WindowsAgentArena, and AndroidWorld benchmarks by adapting its grounding strategy to different UI contexts.
SE-GUI applies self-evolutionary reinforcement learning fine-tuning with dense policy gradients. Starting from Qwen2.5-VL-7B with only 3,000 seed samples, SE-GUI achieves remarkable results — 88.2% on ScreenSpot and 47.3% on ScreenSpot-Pro, beating the much larger UI-TARS-72B model by 24.2% on the professional benchmark.
AgentCPM-GUI uses a three-stage training pipeline: grounding pre-training on 12 million samples, supervised fine-tuning on 55,000 trajectories, and GRPO reinforcement learning. Focused on mobile/Android environments, it achieves 96.9% type-match accuracy on the CAGUI Chinese GUI benchmark.
Visual grounding maps natural language instructions to screen coordinates. The agent must identify which UI element corresponds to the instruction and predict a precise click point. Methods range from direct coordinate regression to attention-based region proposals.
Agents analyze raw screenshots to build structured representations of UI elements without access to the DOM or accessibility APIs. Techniques include edge detection, Information-Sensitive Cropping (ISC), and learned element detectors.
Given a grounded element, the agent predicts the appropriate action (click, type, scroll, drag) and its parameters. JSON-structured action spaces provide a clean interface between the vision model and the execution environment.
RegionFocus introduces test-time scaling for GUI grounding by extracting sub-regions around uncertain predictions and re-analyzing them at higher resolution. Applied to 72B parameter models, this iterative refinement achieves 61.6% on ScreenSpot-Pro, the current state of the art.
# Key GUI grounding benchmarks and representative scores benchmarks = { "ScreenSpot": {"SE-GUI-7B": 88.2, "description": "Multi-platform element grounding"}, "ScreenSpot-v2": {"SE-GUI-7B": 90.25, "description": "Updated multi-platform grounding"}, "ScreenSpot-Pro": {"RegionFocus": 61.6, "SE-GUI-7B": 47.3, "description": "Professional apps, 23 apps, 5 domains, 3 OS"}, "OSWorld": {"Agent-S2": "SOTA", "description": "Desktop OS task completion"}, "AndroidWorld": {"Agent-S2": "SOTA", "description": "Mobile task completion"}, "CAGUI": {"AgentCPM-GUI": 96.9, "description": "Chinese GUI type-match"}, }
The dominant architecture for GUI grounding agents follows this pipeline: