====== MobileAgent ======

**MobileAgent** is an open-source family of autonomous GUI agents developed by X-PLUG (Alibaba/Tongyi Lab) for operating mobile devices, desktops, and web interfaces through visual perception.(([[https://github.com/X-PLUG/MobileAgent|MobileAgent on GitHub]])) It uses multi-modal language models to see, understand, and interact with graphical interfaces without relying on app-specific APIs or system metadata.

{{tag>ai_agent gui mobile visual_perception multi_agent alibaba open_source}}

| **Repository** | [[https://github.com/X-PLUG/MobileAgent]] |
| **Website** | [[https://x-plug.github.io/MobileAgent/]](([[https://x-plug.github.io/MobileAgent/|MobileAgent Project Website]])) |
| **Language** | Python |
| **License** | Open Source |
| **Creator** | X-PLUG / Alibaba Tongyi Lab |
| **Key Authors** | Junyang Wang, Haiyang Xu, Ming Yan |

===== Overview =====

MobileAgent tackles the "last mile" problem of AI assistants: enabling them to actually operate graphical user interfaces the way humans do. Rather than requiring app APIs or system metadata, MobileAgent perceives screens through screenshots, reasons about UI elements, and executes actions (taps, swipes, text input) autonomously. The project has evolved through multiple versions, each advancing capabilities in speed, efficiency, and cross-platform support.

===== Version History =====

==== Mobile-Agent v1 ====
The original autonomous multi-modal agent using pure visual perception. Independent of XML files or system metadata, enabling unrestricted multi-app operations across diverse mobile environments.(([[https://arxiv.org/html/2401.16158v1|Mobile-Agent v1 Paper (arXiv)]]))

==== Mobile-Agent v2 ====
Introduced multi-agent collaboration for effective navigation in long-context scenarios. Enhanced visual perception for accuracy with GPT-4o support. Accepted at **NeurIPS 2024**.

==== Mobile-Agent v3 ====
Focused on practical deployment: 10-15 seconds per operation, 8GB memory using 2B open-source models, and cross-platform support (mobile, web, PC). Won **Best Demo at CCL 2024**. Released alongside GUI-Owl.(([[https://arxiv.org/html/2508.15144v2|Mobile-Agent-v3 Paper (arXiv)]]))

==== Mobile-Agent-E ====
Hierarchical multi-agent architecture with self-evolution from past experiences. Designed for complex, long-horizon tasks that require learning and adapting over time.

==== PC-Agent ====
Extended the framework to desktop environments (Mac and Windows) with hierarchical task automation.

===== GUI-Owl =====

GUI-Owl is a foundational GUI agent model released with v3. It achieves state-of-the-art performance among open-source end-to-end models across ten GUI benchmarks spanning desktop and mobile. GUI-Owl-7B scores 66.4 on AndroidWorld and 29.4 on OSWorld-Verified. Key innovations include cloud-based virtual environment infrastructure and a Self-Evolving GUI Trajectory Production framework.

===== Architecture =====

<mermaid>
graph TD
    A[User Task] --> B[MobileAgent Framework]
    B --> C{Version / Mode}
    C --> D[v1: Single Agent]
    C --> E[v2: Multi-Agent Collab]
    C --> F[v3: Cross-Platform]
    C --> G[Agent-E: Self-Evolving]
    C --> H[PC-Agent: Desktop]
    D --> I[Visual Perception]
    E --> I
    F --> I
    G --> I
    H --> I
    I --> J[Screenshot Analysis]
    J --> K[Element Grounding]
    K --> L[Action Planning]
    L --> M{Execution}
    M --> N[Tap / Click]
    M --> O[Swipe / Scroll]
    M --> P[Text Input]
    M --> Q[App Navigation]
    I --> R{Model Backend}
    R --> S[GUI-Owl 7B/2B]
    R --> T[Qwen-VL]
    R --> U[GPT-4o / Claude]
    R --> V[Gemini]
    B --> W[ADB Connection]
    W --> X[Android / HarmonyOS]
    B --> Y[Desktop Control]
    Y --> Z[Mac / Windows]
</mermaid>

===== Supported Platforms =====

  * **Mobile** -- Android, HarmonyOS (version 4 and below) via ADB
  * **Desktop** -- Mac, Windows (via PC-Agent)
  * **Web** -- Cross-platform web automation (v3)
  * **Note** -- iOS is not currently supported

===== How It Works =====

  - Connect device via **ADB** (enable USB debugging and file transfer)
  - Configure backbone model (OpenAI, Gemini, Claude, or local Qwen-VL) with API keys
  - Run task scripts: ''run_task.sh'' for single tasks, ''run_tasks_evolution.sh'' for self-evolving sequences
  - Agent perceives screen via screenshots, reasons about UI elements, and executes actions
  - Online demos available on Hugging Face and ModelScope for screenshot-based testing

===== Installation =====

<code bash>
# Clone the repository
git clone https://github.com/X-PLUG/MobileAgent
cd MobileAgent

# Install dependencies
pip install -r requirements.txt

# Configure API keys and backbone model
# Edit config files for your preferred model

# Run a task (Android via ADB)
bash run_task.sh

# Run self-evolving task sequence
bash run_tasks_evolution.sh
</code>


===== See Also =====

  * [[cheshire_cat]] -- AI agent microservice framework
  * [[claude_code]] -- Anthropic Claude Code CLI agent
  * [[github_copilot]] -- GitHub Copilot ecosystem

===== References =====