OmniParser: A Vision-Based Tool for GUI Agent Screen Parsing

Introduction

OmniParser, developed by Microsoft, is a powerful tool designed to parse user interface screenshots into structured and easily understandable elements. This capability significantly enhances the performance of vision-based models, including GPT-4V, by allowing them to generate actions that are precisely grounded in corresponding regions of a graphical user interface. OmniParser is a crucial step towards building pure vision-based GUI agents, enabling advanced automation and interaction with digital interfaces. It has seen continuous development, including the release of V2 and OmniTool, which allows control of Windows 11 VMs with various large language models.

Installation

To get started with OmniParser, follow these steps to clone the repository and set up your environment. Ensure you have conda installed for environment management.

First, clone the repository:

git clone https://github.com/microsoft/OmniParser.git
cd OmniParser

Next, create and activate a new conda environment, then install the required dependencies:

conda create -n "omni" python==3.12
conda activate omni
pip install -r requirements.txt

Finally, download the necessary V2 model checkpoints:

# download the model checkpoints to local directory OmniParser/weights/
for f in icon_detect/{train_args.yaml,model.pt,model.yaml} icon_caption/{config.json,generation_config.json,model.safetensors}; do huggingface-cli download microsoft/OmniParser-v2.0 "$f" --local-dir weights; done
mv weights/icon_caption weights/icon_caption_florence

Examples

OmniParser provides several ways to explore its capabilities. You can find simple examples demonstrating its core functionalities within the demo.ipynb Jupyter Notebook.

For an interactive experience, a Gradio demo is available. To run it, simply execute the following command in your activated environment:

python gradio_demo.py

Additionally, you can explore the official HuggingFace Space demo for OmniParser V2 to see it in action.

Why Use OmniParser

OmniParser stands out as a critical tool for anyone working with GUI automation, AI agents, or computer vision applications. Its ability to accurately parse screen elements into a structured format makes it invaluable for:

Enhancing AI Agents: It empowers large language models and vision models to understand and interact with graphical user interfaces more effectively.
Pure Vision-Based Interaction: It moves towards a future where AI agents can operate purely based on visual input, mimicking human interaction.
Robust Performance: OmniParser has achieved state-of-the-art results on benchmarks like Screen Spot Pro and Windows Agent Arena.
Continuous Development: With ongoing updates like OmniParser V2, OmniTool for Windows VM control, and support for various LLMs (OpenAI, DeepSeek, Qwen, Anthropic), it remains at the forefront of GUI parsing technology.
Detailed Element Detection: Features like fine-grained icon detection and interactability prediction provide a rich understanding of the UI.

Links

Explore OmniParser further through these official resources:

OmniParser: A Vision-Based Tool for GUI Agent Screen Parsing

Summary

Repository Info

Tags

Introduction

Installation

Examples

Why Use OmniParser

Links