Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
This is an old revision of the document!
MobileAgent is an open-source family of autonomous GUI agents developed by X-PLUG (Alibaba/Tongyi Lab) for operating mobile devices, desktops, and web interfaces through visual perception.1) It uses multi-modal language models to see, understand, and interact with graphical interfaces without relying on app-specific APIs or system metadata.
ai_agent gui mobile visual_perception multi_agent alibaba open_source
| Repository | https://github.com/X-PLUG/MobileAgent |
| Website | https://x-plug.github.io/MobileAgent/ |
| Language | Python |
| License | Open Source |
| Creator | X-PLUG / Alibaba Tongyi Lab |
| Key Authors | Junyang Wang, Haiyang Xu, Ming Yan |
MobileAgent tackles the “last mile” problem of AI assistants: enabling them to actually operate graphical user interfaces the way humans do. Rather than requiring app APIs or system metadata, MobileAgent perceives screens through screenshots, reasons about UI elements, and executes actions (taps, swipes, text input) autonomously. The project has evolved through multiple versions, each advancing capabilities in speed, efficiency, and cross-platform support.
The original autonomous multi-modal agent using pure visual perception. Independent of XML files or system metadata, enabling unrestricted multi-app operations across diverse mobile environments.2)
Introduced multi-agent collaboration for effective navigation in long-context scenarios. Enhanced visual perception for accuracy with GPT-4o support. Accepted at NeurIPS 2024.
Focused on practical deployment: 10-15 seconds per operation, 8GB memory using 2B open-source models, and cross-platform support (mobile, web, PC). Won Best Demo at CCL 2024. Released alongside GUI-Owl.3)
Hierarchical multi-agent architecture with self-evolution from past experiences. Designed for complex, long-horizon tasks that require learning and adapting over time.
Extended the framework to desktop environments (Mac and Windows) with hierarchical task automation.
GUI-Owl is a foundational GUI agent model released with v3. It achieves state-of-the-art performance among open-source end-to-end models across ten GUI benchmarks spanning desktop and mobile. GUI-Owl-7B scores 66.4 on AndroidWorld and 29.4 on OSWorld-Verified. Key innovations include cloud-based virtual environment infrastructure and a Self-Evolving GUI Trajectory Production framework.
run_task.sh for single tasks, run_tasks_evolution.sh for self-evolving sequences# Clone the repository git clone https://github.com/X-PLUG/MobileAgent cd MobileAgent # Install dependencies pip install -r requirements.txt # Configure API keys and backbone model # Edit config files for your preferred model # Run a task (Android via ADB) bash run_task.sh # Run self-evolving task sequence bash run_tasks_evolution.sh