Skip to content

openadapt-evals

GitHub

Auto-generated from OpenAdaptAI/openadapt-evals. Last synced: 2026-03-04 01:23 UTC


OpenAdapt Evals

Tests Build PyPI Python 3.10+ License: MIT

Evaluation infrastructure for GUI agent benchmarks, built for OpenAdapt.

What is OpenAdapt Evals?

OpenAdapt Evals is a unified framework for evaluating GUI automation agents against standardized benchmarks such as Windows Agent Arena (WAA). It provides benchmark adapters, agent interfaces, cloud VM infrastructure (Azure and AWS) for parallel evaluation, and result visualization -- everything needed to go from "I have a GUI agent" to "here are its benchmark scores."

Benchmark Viewer

Benchmark Viewer Animation

More screenshots **Task Detail View** -- step-by-step replay with screenshots, actions, and execution logs: ![Task Detail View](https://raw.githubusercontent.com/OpenAdaptAI/openadapt-evals/main/docs/screenshots/desktop_task_detail.png) **Cost Tracking Dashboard** -- real-time VM cost monitoring with tiered sizing and spot instances: ![Cost Dashboard](https://raw.githubusercontent.com/OpenAdaptAI/openadapt-evals/main/screenshots/cost_dashboard_preview.png)

Key Features

  • Benchmark adapters for WAA (live, mock, and local modes), with an extensible base for OSWorld, WebArena, and others
  • Task setup handlers -- verify_apps and install_apps ensure required applications are present on the Windows VM before evaluation begins
  • Agent interfaces including ApiAgent (Claude / GPT), ClaudeComputerUseAgent (with coordinate clamping and fail-safe recovery), RetrievalAugmentedAgent, RandomAgent, and PolicyAgent
  • Multi-cloud VM infrastructure with AzureVMManager, AWSVMManager, PoolManager, SSHTunnelManager, and VMMonitor for running evaluations at scale on Azure or AWS
  • End-to-end eval pipeline (scripts/run_eval_pipeline.py) -- orchestrates demo generation, VM lifecycle, SSH tunnels, and ZS/DC evaluation in a single command
  • RL training environment -- RLEnvironment wrapper provides a Gymnasium-style reset/step/evaluate interface for online RL (GRPO, PPO) with outcome-based rewards from WAA scores
  • Annotation pipeline -- VLM-based screenshot annotation (annotation.py, vlm.py) migrated from openadapt-ml so the full record-annotate-evaluate workflow runs within this repo
  • 4-layer WAA probe -- probe --detailed checks screenshot capture, accessibility tree, action pipeline, and scoring independently; supports --json and --layers filtering
  • Demo recording and review -- VNC-based demo capture with auto-persistence (incremental meta.json, hardlinked PNGs), JPEG thumbnail deduplication, and markdown review artifact generation
  • CLI tools -- oa-vm for VM and pool management (50+ commands), benchmark CLI for running evals
  • Cost optimization -- tiered VM sizing, spot instance support, and real-time cost tracking
  • Results visualization -- HTML viewer with step-by-step screenshot replay, execution logs, and domain breakdowns
  • Trace export for converting evaluation trajectories into training data
  • Configuration via pydantic-settings with automatic .env loading

Installation

pip install openadapt-evals

With optional dependencies:

pip install openadapt-evals[azure]      # Azure VM management
pip install openadapt-evals[aws]        # AWS EC2 management
pip install openadapt-evals[retrieval]  # Demo retrieval agent
pip install openadapt-evals[viewer]     # Live results viewer
pip install openadapt-evals[all]        # Everything

Quick Start

Run a mock evaluation (no VM required)

openadapt-evals mock --tasks 10

Run a live evaluation against a WAA server

# Start with a single VM (Azure by default)
oa-vm pool-create --workers 1
oa-vm pool-wait

# Or use AWS
oa-vm pool-create --cloud aws --workers 1
oa-vm pool-wait --cloud aws

# Run evaluation
openadapt-evals run --agent api-claude --task notepad_1

# View results
openadapt-evals view --run-name live_eval

# Clean up (stop billing)
oa-vm pool-cleanup -y

Python API

from openadapt_evals import (
    ApiAgent,
    WAALiveAdapter,
    WAALiveConfig,
    evaluate_agent_on_benchmark,
    compute_metrics,
)

adapter = WAALiveAdapter(WAALiveConfig(server_url="http://localhost:5001"))
agent = ApiAgent(provider="anthropic")

results = evaluate_agent_on_benchmark(agent, adapter, task_ids=["notepad_1"])
metrics = compute_metrics(results)
print(f"Success rate: {metrics['success_rate']:.1%}")

Demo-conditioned evaluation

Record demos on a remote VM via VNC, annotate with a VLM, then run demo-conditioned eval:

# 1. Pre-flight check: verify all required apps are installed
python scripts/record_waa_demos.py record-waa \
  --tasks 04d9aeaf,0a0faba3 \
  --server http://localhost:5001 \
  --verify

# 2. Record demos interactively (perform actions on VNC, press Enter after each step)
python scripts/record_waa_demos.py record-waa \
  --tasks 04d9aeaf,0a0faba3 \
  --server http://localhost:5001 \
  --output waa_recordings/

# 3. Annotate recordings with VLM
python scripts/record_waa_demos.py annotate \
  --recordings waa_recordings/ \
  --output annotated_demos/ \
  --provider openai

# 4. Run demo-conditioned eval
python scripts/record_waa_demos.py eval \
  --demo_dir annotated_demos/ \
  --tasks 04d9aeaf,0a0faba3

End-to-end eval pipeline

For a fully automated flow (demo generation, VM lifecycle, SSH tunnels, ZS and DC evaluation):

# Run for all recordings that have demos
python scripts/run_eval_pipeline.py

# Specific task(s)
python scripts/run_eval_pipeline.py --tasks 04d9aeaf

# Dry run
python scripts/run_eval_pipeline.py --tasks 04d9aeaf --dry-run

# AWS instead of Azure
python scripts/run_eval_pipeline.py --cloud aws --vm-name waa-pool-00

Parallel evaluation

# Create a pool of VMs and distribute tasks (Azure)
oa-vm pool-create --workers 5
oa-vm pool-wait
oa-vm pool-run --tasks 50

# Same workflow on AWS
oa-vm pool-create --cloud aws --workers 5
oa-vm pool-wait --cloud aws
oa-vm pool-run --cloud aws --tasks 50

# Or use Azure ML orchestration
openadapt-evals azure --workers 10 --waa-path /path/to/WindowsAgentArena

Architecture

openadapt_evals/
├── agents/               # Agent implementations
│   ├── base.py           #   BenchmarkAgent ABC
│   ├── api_agent.py      #   ApiAgent (Claude, GPT)
│   ├── claude_computer_use_agent.py  # ClaudeComputerUseAgent (coord clamping, fail-safe)
│   ├── retrieval_agent.py#   RetrievalAugmentedAgent
│   └── policy_agent.py   #   PolicyAgent (trained models)
├── adapters/             # Benchmark adapters
│   ├── base.py           #   BenchmarkAdapter ABC + data classes
│   ├── rl_env.py         #   RLEnvironment (Gymnasium-style wrapper for GRPO/PPO)
│   └── waa/              #   WAA live, mock, and local adapters
├── infrastructure/       # Cloud VM and pool management
│   ├── azure_vm.py       #   AzureVMManager
│   ├── aws_vm.py         #   AWSVMManager
│   ├── vm_provider.py    #   VMProvider protocol (multi-cloud abstraction)
│   ├── pool.py           #   PoolManager
│   ├── probe.py          #   4-layer WAA probe (screenshot, a11y, action, score)
│   ├── ssh_tunnel.py     #   SSHTunnelManager
│   └── vm_monitor.py     #   VMMonitor dashboard
├── evaluation/           # Shared evaluation utilities
│   └── metrics.py        #   fuzzy_match and scoring functions
├── benchmarks/           # Evaluation runner, CLI, viewers
│   ├── runner.py         #   evaluate_agent_on_benchmark()
│   ├── cli.py            #   Benchmark CLI (run, mock, live, view, probe)
│   ├── vm_cli.py         #   VM/Pool CLI (oa-vm, 50+ commands)
│   ├── viewer.py         #   HTML results viewer
│   ├── pool_viewer.py    #   Pool results viewer
│   └── trace_export.py   #   Training data export
├── waa_deploy/           # WAA Docker image & task setup
│   ├── evaluate_server.py#   Flask server (port 5050): /setup, /evaluate, /task
│   ├── Dockerfile        #   QEMU + Windows 11 + pre-downloaded apps
│   └── tools_config.json #   App installer URLs and configs
├── annotation.py         # VLM-based demo annotation pipeline
├── vlm.py                # VLM provider abstraction (OpenAI, Anthropic)
├── server/               # WAA server extensions
├── config.py             # Settings (pydantic-settings, .env)
└── __init__.py
scripts/
├── run_eval_pipeline.py      # End-to-end eval: demo gen + VM + ZS/DC eval
├── record_waa_demos.py       # Record demos via VNC
├── generate_demo_review.py   # Markdown review artifacts with thumbnails
├── run_grpo_rollout.py       # Example: collect RL rollouts from WAA
├── refine_demo.py            # Two-pass LLM demo refinement
└── run_dc_eval.py            # Demo-conditioned evaluation

How it fits together

LOCAL MACHINE                          CLOUD VM (Azure or AWS, Ubuntu)
┌─────────────────────┐                ┌──────────────────────────────┐
│  oa-vm CLI          │   SSH Tunnel   │  Docker                      │
│  (pool management)  │ ─────────────> │  ├─ evaluate_server (:5050)  │
│                     │  :5001 → :5000 │  │  └─ /setup, /evaluate     │
│  openadapt-evals    │  :5051 → :5050 │  ├─ Samba share (/tmp/smb/)  │
│  (benchmark runner) │  :8006 → :8006 │  └─ QEMU (Win 11)           │
│                     │                │     ├─ WAA Flask API (:5000) │
│                     │                │     ├─ \\host.lan\Data\      │
│                     │                │     └─ Agent                 │
└─────────────────────┘                └──────────────────────────────┘

Both backends use the same VMProvider protocol. Pass --cloud azure (default) or --cloud aws to any pool command. AWS requires m5.metal instances ($4.61/hr) for KVM/QEMU nested virtualization; Azure uses Standard_D8ds_v5 ($0.38/hr).

Windows 11 on AWS EC2

WAA Task Setup & App Management

The evaluate server (waa_deploy/evaluate_server.py) runs on the Docker Linux side (port 5050) and orchestrates task setup on the Windows VM. The /setup endpoint accepts a list of setup handlers:

# Check if required apps are installed on the Windows VM
curl -X POST http://localhost:5051/setup \
  -H "Content-Type: application/json" \
  -d '{"config": [{"type": "verify_apps", "parameters": {"apps": ["libreoffice-calc"]}}]}'
# → 200 if all present, 422 if any missing

# Install missing apps via two-phase pipeline
curl -X POST http://localhost:5051/setup \
  -H "Content-Type: application/json" \
  -d '{"config": [{"type": "install_apps", "parameters": {"apps": ["libreoffice-calc"]}}]}'

Two-phase install pipeline

Large installers (e.g. LibreOffice 350MB MSI) can't be downloaded within the WAA server's 120s command timeout. The install_apps handler solves this with a two-phase approach:

  1. Download on Linux -- the evaluate server downloads the installer to the Samba share (/tmp/smb/ = \\host.lan\Data\ on Windows), with no timeout constraint
  2. Install on Windows -- a small PowerShell script is written to the Samba share and executed via the WAA server, running only msiexec (fast, no download)

The Dockerfile also pre-downloads LibreOffice at build time with dynamic version discovery, so first-boot installs work without depending on mirror availability.

Automatic app verification

When a task config includes related_apps, the live adapter automatically prepends a verify_apps step before the task's setup config. The --verify flag on record_waa_demos.py provides a pre-flight check across all tasks before starting a recording session.

LibreOffice Calc running inside Windows 11 QEMU VM via noVNC in Chrome

CLI Reference

Benchmark CLI (openadapt-evals)

Command Description
run Run live evaluation (localhost:5001 default)
mock Run with mock adapter (no VM required)
live Run against a WAA server (full control)
eval-suite Automated full-cycle evaluation (ZS + DC)
azure Run parallel evaluation on Azure ML
probe Check WAA readiness (--detailed for 4-layer diagnostics, --json, --layers)
view Generate HTML viewer for results
estimate Estimate Azure costs

VM/Pool CLI (oa-vm)

Command Description
pool-create Create N VMs with Docker and WAA
pool-wait Wait until WAA is ready on all workers
pool-run Distribute tasks across pool workers
pool-status Show status of all pool VMs
pool-pause Deallocate pool VMs (stop billing)
pool-resume Restart deallocated pool VMs
pool-cleanup Delete all pool VMs and resources
image-create Create golden image from a pool VM
image-list List available golden images
vm monitor Dashboard with SSH tunnels
vm setup-waa Deploy WAA container on a VM
smoke-test-aws Verify AWS credentials, AMI, VPC, lifecycle

All pool commands accept --cloud azure (default) or --cloud aws.

Run oa-vm --help for the full list of 50+ commands.

Configuration

Settings are loaded automatically from environment variables or a .env file in the project root via pydantic-settings.

# .env
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...

# Azure (for --cloud azure VM management)
AZURE_SUBSCRIPTION_ID=...
AZURE_ML_RESOURCE_GROUP=...
AZURE_ML_WORKSPACE_NAME=...

AWS authentication

AWS credentials are resolved via boto3's default credential chain. SSO (IAM Identity Center) is recommended for interactive use:

# One-time setup — opens a guided wizard
aws configure sso
# Prompts for: SSO start URL, region, account, role name, profile name

# Login (opens browser, caches short-lived token)
aws sso login

# Verify it works
oa-vm smoke-test-aws

# All oa-vm --cloud aws commands now work automatically
oa-vm pool-create --cloud aws --workers 1
Example ~/.aws/config for SSO
[default]
sso_session = my-org
sso_account_id = 111122223333
sso_role_name = PowerUserAccess
region = us-east-1

[sso-session my-org]
sso_start_url = https://my-org.awsapps.com/start
sso_region = us-east-1
sso_registration_scopes = sso:account:access

Static keys (AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY in .env) also work but are not recommended for interactive use -- they don't expire and are a security risk if leaked.

See openadapt_evals/config.py for all available settings.

Custom Agents

Implement the BenchmarkAgent interface to evaluate your own agent:

from openadapt_evals import BenchmarkAgent, BenchmarkAction, BenchmarkObservation, BenchmarkTask

class MyAgent(BenchmarkAgent):
    def act(
        self,
        observation: BenchmarkObservation,
        task: BenchmarkTask,
        history: list[tuple[BenchmarkObservation, BenchmarkAction]] | None = None,
    ) -> BenchmarkAction:
        # Your agent logic here
        return BenchmarkAction(type="click", x=0.5, y=0.5)

    def reset(self) -> None:
        pass

Contributing

We welcome contributions. To get started:

git clone https://github.com/OpenAdaptAI/openadapt-evals.git
cd openadapt-evals
uv sync --extra dev
uv run pytest tests/ -v

See CLAUDE.md for development conventions and architecture details.

Project Description
OpenAdapt Desktop automation with demo-conditioned AI agents
openadapt-ml Training and policy runtime
openadapt-capture Screen recording and demo sharing
openadapt-consilium Multi-model consensus library
openadapt-grounding UI element localization

License

MIT


View on GitHub | Report an issue