openadapt-evals¶
Benchmark evaluation infrastructure for GUI automation agents.
Repository: OpenAdaptAI/openadapt-evals
Installation¶
Overview¶
The evals package provides:
- Benchmark adapters for standardized evaluation
- API agent implementations (Claude, GPT-4V)
- Evaluation runners and metrics
- Mock environments for testing
CLI Commands¶
Run Evaluation¶
# Evaluate a trained policy
openadapt eval run --checkpoint training_output/model.pt --benchmark waa
# Evaluate an API agent
openadapt eval run --agent api-claude --benchmark waa
Options:
--checkpoint- Path to trained policy checkpoint--agent- Agent type (api-claude, api-gpt4v, custom)--benchmark- Benchmark name (waa, osworld, etc.)--tasks- Number of tasks to evaluate (default: all)--output- Output directory for results
Run Mock Evaluation¶
Test your setup without running actual benchmarks:
List Available Benchmarks¶
Supported Benchmarks¶
| Benchmark | Description | Tasks |
|---|---|---|
waa | Windows Agent Arena | 154 |
osworld | OSWorld | 369 |
webarena | WebArena | 812 |
mock | Mock benchmark for testing | Configurable |
API Agents¶
Claude Agent¶
GPT-4V Agent¶
Python API¶
from openadapt_evals import ApiAgent, BenchmarkAdapter, evaluate_agent_on_benchmark
# Create an API agent
agent = ApiAgent.claude()
# Or load a trained policy
from openadapt_ml import AgentPolicy
agent = AgentPolicy.from_checkpoint("model.pt")
# Run evaluation
results = evaluate_agent_on_benchmark(
agent=agent,
benchmark="waa",
num_tasks=10
)
print(f"Success rate: {results.success_rate:.2%}")
print(f"Average steps: {results.avg_steps:.1f}")
Evaluation Loop¶
flowchart TB
subgraph Agent["Agent Under Test"]
POLICY[Agent Policy]
API[API Agent]
end
subgraph Benchmark["Benchmark System"]
ADAPTER[Benchmark Adapter]
MOCK[Mock Adapter]
LIVE[Live Adapter]
end
subgraph Tasks["Task Execution"]
TASK[Get Task]
OBS[Observe State]
ACT[Execute Action]
CHECK[Check Success]
end
subgraph Metrics["Metrics"]
SUCCESS[Success Rate]
STEPS[Avg Steps]
TIME[Execution Time]
end
POLICY --> ADAPTER
API --> ADAPTER
ADAPTER --> MOCK
ADAPTER --> LIVE
MOCK --> TASK
LIVE --> TASK
TASK --> OBS
OBS --> POLICY
OBS --> API
POLICY --> ACT
API --> ACT
ACT --> CHECK
CHECK -->|next| TASK
CHECK -->|done| SUCCESS
CHECK --> STEPS
CHECK --> TIME Key Exports¶
| Export | Description |
|---|---|
ApiAgent | API-based agent (Claude, GPT-4V) |
BenchmarkAdapter | Benchmark interface |
MockAdapter | Mock benchmark for testing |
evaluate_agent_on_benchmark | Agent evaluation function |
EvalResults | Evaluation results container |
Metrics¶
| Metric | Description |
|---|---|
| Success Rate | Percentage of tasks completed successfully |
| Average Steps | Mean number of steps per task |
| Execution Time | Total and per-task timing |
| Error Rate | Percentage of tasks that errored |
Related Packages¶
- openadapt-ml - Learn policies to evaluate
- openadapt-capture - Collect demonstrations