Agent-S: Open Agentic Framework for Human-like Computer Use

Introduction

Agent-S is an innovative open-source framework from Simular AI, designed to empower AI agents to interact with computers autonomously, much like a human user. At its core, Agent-S aims to build intelligent GUI agents capable of learning from past experiences and executing complex tasks across various operating systems, including Windows, macOS, and Linux.

The framework has achieved state-of-the-art results on challenging benchmarks like OSWorld, WindowsAgentArena, and AndroidWorld, with its latest iteration, Agent S3, demonstrating performance approaching human-level accuracy. Whether you are interested in advanced AI, automation, or contributing to cutting-edge agent-based systems, Agent-S offers a robust and flexible platform.

For more details, visit the Agent-S GitHub repository.

Installation

Getting started with Agent-S is straightforward. Follow these steps to set up the framework on your machine.

Prerequisites

Single Monitor: Agent-S is optimized for single monitor setups.
Security: The agent executes Python code to control your computer, so use it with caution in trusted environments.
Supported Platforms: Agent-S supports Linux, macOS, and Windows.

Installation Steps

To install Agent S3 without cloning the repository, use pip:

pip install gui-agents

If you plan to contribute or test changes, clone the repository and install in editable mode:

pip install -e .

Additionally, pytesseract requires Tesseract OCR to be installed:

brew install tesseract

API Configuration

You need to configure your API keys for the language models. Choose one of the following methods:

Option 1: Environment Variables

Add your API keys to your shell configuration file (e.g., .bashrc or .zshrc):

export OPENAI_API_KEY=<YOUR_API_KEY>
export ANTHROPIC_API_KEY=<YOUR_ANTHROPIC_API_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN>

Option 2: Python Script

Set environment variables within your Python script:

import os
os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>"

Agent-S supports various models including Azure OpenAI, Anthropic, Gemini, Open Router, and vLLM inference. For optimal performance, it is recommended to use UI-TARS-1.5-7B as the grounding model.

Examples

Agent-S can be run via a command-line interface (CLI) or integrated into your Python projects using its SDK.

CLI Usage

The recommended setup for Agent S3 involves using OpenAI gpt-5-2025-08-07 as the main model, paired with UI-TARS-1.5-7B for grounding.

Run Agent S3 with the required parameters:

agent_s \
    --provider openai \
    --model gpt-5-2025-08-07 \
    --ground_provider huggingface \
    --ground_url http://localhost:8080 \
    --ground_model ui-tars-1.5-7b \
    --grounding_width 1920 \
    --grounding_height 1080

Local Coding Environment (Optional)

For tasks requiring code execution, enable the local coding environment:

agent_s \
    --provider openai \
    --model gpt-5-2025-08-07 \
    --ground_provider huggingface \
    --ground_url http://localhost:8080 \
    --ground_model ui-tars-1.5-7b \
    --grounding_width 1920 \
    --grounding_height 1080 \
    --enable_local_env

Warning: The local coding environment executes arbitrary Python and Bash code locally. Use this feature only in trusted environments and with trusted inputs.

SDK Usage Snippet

Here's a brief example of how to use the gui_agents SDK to query the agent:

import pyautogui
import io
from gui_agents.s3.agents.agent_s import AgentS3
from gui_agents.s3.agents.grounding import OSWorldACI

# ... (engine_params and grounding_engine_params setup as per README) ...

grounding_agent = OSWorldACI(
    # ... parameters ...
)

agent = AgentS3(
    # ... parameters ...
)

# Get screenshot.
screenshot = pyautogui.screenshot()
buffered = io.BytesIO()
screenshot.save(buffered, format="PNG")
screenshot_bytes = buffered.getvalue()

obs = {
  "screenshot": screenshot_bytes,
}

instruction = "Close VS Code"
info, action = agent.predict(instruction=instruction, observation=obs)

exec(action[0])

Why Use Agent-S?

Agent-S stands out as a powerful tool for several reasons:

Human-like Computer Interaction: It enables AI agents to understand and interact with graphical user interfaces (GUIs) in a way that mimics human behavior, bridging the gap between AI and computer use.
State-of-the-Art Performance: With Agent S3, the framework achieves leading results on benchmarks like OSWorld, WindowsAgentArena, and AndroidWorld, demonstrating strong generalization capabilities.
Open and Extensible Framework: Being open-source, Agent-S provides a flexible foundation for researchers and developers to build upon, customize, and integrate into their own projects.
Multi-Platform Support: It runs seamlessly across Windows, macOS, and Linux, making it versatile for various environments.
Advanced Agentic Capabilities: Features like reflection agents and an optional local coding environment enhance the agent's ability to plan, execute, and debug complex tasks.
Flexible Model Integration: Supports a wide range of LLM providers and grounding models, allowing users to choose the best fit for their needs.