This is an old revision of the document!

MobileAgent

MobileAgent is an open-source family of autonomous GUI agents developed by X-PLUG (Alibaba/Tongyi Lab) for operating mobile devices, desktops, and web interfaces through visual perception.¹⁾ It uses multi-modal language models to see, understand, and interact with graphical interfaces without relying on app-specific APIs or system metadata.

ai_agent gui mobile visual_perception multi_agent alibaba open_source

Repository	https://github.com/X-PLUG/MobileAgent
Website	https://x-plug.github.io/MobileAgent/
Language	Python
License	Open Source
Creator	X-PLUG / Alibaba Tongyi Lab
Key Authors	Junyang Wang, Haiyang Xu, Ming Yan

Overview

MobileAgent tackles the “last mile” problem of AI assistants: enabling them to actually operate graphical user interfaces the way humans do. Rather than requiring app APIs or system metadata, MobileAgent perceives screens through screenshots, reasons about UI elements, and executes actions (taps, swipes, text input) autonomously. The project has evolved through multiple versions, each advancing capabilities in speed, efficiency, and cross-platform support.

Version History

Mobile-Agent v1

The original autonomous multi-modal agent using pure visual perception. Independent of XML files or system metadata, enabling unrestricted multi-app operations across diverse mobile environments.²⁾

Mobile-Agent v2

Introduced multi-agent collaboration for effective navigation in long-context scenarios. Enhanced visual perception for accuracy with GPT-4o support. Accepted at NeurIPS 2024.

Mobile-Agent v3

Focused on practical deployment: 10-15 seconds per operation, 8GB memory using 2B open-source models, and cross-platform support (mobile, web, PC). Won Best Demo at CCL 2024. Released alongside GUI-Owl.³⁾

Mobile-Agent-E

Hierarchical multi-agent architecture with self-evolution from past experiences. Designed for complex, long-horizon tasks that require learning and adapting over time.

PC-Agent

Extended the framework to desktop environments (Mac and Windows) with hierarchical task automation.

GUI-Owl

GUI-Owl is a foundational GUI agent model released with v3. It achieves state-of-the-art performance among open-source end-to-end models across ten GUI benchmarks spanning desktop and mobile. GUI-Owl-7B scores 66.4 on AndroidWorld and 29.4 on OSWorld-Verified. Key innovations include cloud-based virtual environment infrastructure and a Self-Evolving GUI Trajectory Production framework.

Architecture

graph TD A[User Task] --> B[MobileAgent Framework] B --> C{Version / Mode} C --> D[v1: Single Agent] C --> E[v2: Multi-Agent Collab] C --> F[v3: Cross-Platform] C --> G[Agent-E: Self-Evolving] C --> H[PC-Agent: Desktop] D --> I[Visual Perception] E --> I F --> I G --> I H --> I I --> J[Screenshot Analysis] J --> K[Element Grounding] K --> L[Action Planning] L --> M{Execution} M --> N[Tap / Click] M --> O[Swipe / Scroll] M --> P[Text Input] M --> Q[App Navigation] I --> R{Model Backend} R --> S[GUI-Owl 7B/2B] R --> T[Qwen-VL] R --> U[GPT-4o / Claude] R --> V[Gemini] B --> W[ADB Connection] W --> X[Android / HarmonyOS] B --> Y[Desktop Control] Y --> Z[Mac / Windows]

Supported Platforms

Mobile – Android, HarmonyOS (version 4 and below) via ADB
Desktop – Mac, Windows (via PC-Agent)
Web – Cross-platform web automation (v3)
Note – iOS is not currently supported

How It Works

Connect device via ADB (enable USB debugging and file transfer)
Configure backbone model (OpenAI, Gemini, Claude, or local Qwen-VL) with API keys
Run task scripts: run_task.sh for single tasks, run_tasks_evolution.sh for self-evolving sequences
Agent perceives screen via screenshots, reasons about UI elements, and executes actions
Online demos available on Hugging Face and ModelScope for screenshot-based testing

Installation

# Clone the repository
git clone https://github.com/X-PLUG/MobileAgent
cd MobileAgent
 
# Install dependencies
pip install -r requirements.txt
 
# Configure API keys and backbone model
# Edit config files for your preferred model
 
# Run a task (Android via ADB)
bash run_task.sh
 
# Run self-evolving task sequence
bash run_tasks_evolution.sh

AI Agent Knowledge Base

Sidebar

Table of Contents

MobileAgent

Overview

Version History

Mobile-Agent v1

Mobile-Agent v2

Mobile-Agent v3

Mobile-Agent-E

PC-Agent

GUI-Owl

Architecture

Supported Platforms

How It Works

Installation

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

MobileAgent

Overview

Version History

Mobile-Agent v1

Mobile-Agent v2

Mobile-Agent v3

Mobile-Agent-E

PC-Agent

GUI-Owl

Architecture

Supported Platforms

How It Works

Installation

References

See Also

Page Tools