OmniParser: A Vision-Based Tool for GUI Agent Screen Parsing

OmniParser: A Vision-Based Tool for GUI Agent Screen Parsing

Summary

OmniParser is a comprehensive tool developed by Microsoft for parsing user interface screenshots into structured, understandable elements. It significantly enhances the ability of vision-based models, such as GPT-4V, to generate accurate actions grounded in specific regions of a GUI. This project aims to advance pure vision-based GUI agents by providing robust screen parsing capabilities.

Repository Info

Updated on December 28, 2025
View on GitHub

Tags

Click on any tag to explore related repositories

Introduction

OmniParser, developed by Microsoft, is a powerful tool designed to parse user interface screenshots into structured and easily understandable elements. This capability significantly enhances the performance of vision-based models, including GPT-4V, by allowing them to generate actions that are precisely grounded in corresponding regions of a graphical user interface. OmniParser is a crucial step towards building pure vision-based GUI agents, enabling advanced automation and interaction with digital interfaces. It has seen continuous development, including the release of V2 and OmniTool, which allows control of Windows 11 VMs with various large language models.

Installation

To get started with OmniParser, follow these steps to clone the repository and set up your environment. Ensure you have conda installed for environment management.

First, clone the repository:

git clone https://github.com/microsoft/OmniParser.git
cd OmniParser

Next, create and activate a new conda environment, then install the required dependencies:

conda create -n "omni" python==3.12
conda activate omni
pip install -r requirements.txt

Finally, download the necessary V2 model checkpoints:

# download the model checkpoints to local directory OmniParser/weights/
for f in icon_detect/{train_args.yaml,model.pt,model.yaml} icon_caption/{config.json,generation_config.json,model.safetensors}; do huggingface-cli download microsoft/OmniParser-v2.0 "$f" --local-dir weights; done
mv weights/icon_caption weights/icon_caption_florence

Examples

OmniParser provides several ways to explore its capabilities. You can find simple examples demonstrating its core functionalities within the demo.ipynb Jupyter Notebook.

For an interactive experience, a Gradio demo is available. To run it, simply execute the following command in your activated environment:

python gradio_demo.py

Additionally, you can explore the official HuggingFace Space demo for OmniParser V2 to see it in action.

Why Use OmniParser

OmniParser stands out as a critical tool for anyone working with GUI automation, AI agents, or computer vision applications. Its ability to accurately parse screen elements into a structured format makes it invaluable for:

  • Enhancing AI Agents: It empowers large language models and vision models to understand and interact with graphical user interfaces more effectively.
  • Pure Vision-Based Interaction: It moves towards a future where AI agents can operate purely based on visual input, mimicking human interaction.
  • Robust Performance: OmniParser has achieved state-of-the-art results on benchmarks like Screen Spot Pro and Windows Agent Arena.
  • Continuous Development: With ongoing updates like OmniParser V2, OmniTool for Windows VM control, and support for various LLMs (OpenAI, DeepSeek, Qwen, Anthropic), it remains at the forefront of GUI parsing technology.
  • Detailed Element Detection: Features like fine-grained icon detection and interactability prediction provide a rich understanding of the UI.

Links

Explore OmniParser further through these official resources: