OpenAdapt Architecture Evolution¶

Version: 3.0 Date: January 2026 Status: Living Document

Executive Summary¶

This document traces the evolution of OpenAdapt from its original alpha vision through the modern modular implementation, synthesizing state-of-the-art GUI agent research into a unified framework. OpenAdapt's core innovation is demonstration-conditioned automation: "show, don't tell."

1. Original Alpha Vision¶

The Three-Stage Pipeline (2023)¶

OpenAdapt was conceived as a three-stage pipeline for AI-first process automation:

+=====================+     +=====================+     +=====================+
|                     |     |                     |     |                     |
|      RECORDING      | --> |      ANALYSIS       | --> |       REPLAY        |
|                     |     |                     |     |                     |
|  Capture human      |     |  Convert to         |     |  Generate and       |
|  demonstrations:    |     |  tokenized format   |     |  replay synthetic   |
|  - Screenshots      |     |  for LMM            |     |  input via model    |
|  - User input       |     |  processing         |     |  completions        |
|                     |     |                     |     |                     |
+=====================+     +=====================+     +=====================+

Original Design Goals¶

From the legacy README:

"The goal is similar to that of Robotic Process Automation (RPA), except that we use Large Multimodal Models instead of conventional RPA tools."

Key Differentiators (Alpha): 1. Model Agnostic - Works with any LMM 2. Auto-Prompted - Learns from demonstration, not user prompts 3. Grounded in Existing Processes - Mitigates hallucinations 4. Universal GUI Support - Desktop, web, and virtualized (Citrix) 5. Open Source - MIT license

Legacy Monolithic Implementation¶

The alpha codebase (legacy/openadapt/) implemented:

openadapt/
  record.py           # Screenshot/event capture
  replay.py           # Strategy-based playback
  models.py           # Recording, ActionEvent, Screenshot, WindowEvent
  events.py           # Event aggregation/processing
  strategies/
    base.py           # BaseReplayStrategy abstract class
    naive.py          # Direct literal replay
    stateful.py       # GPT-4 + OS-level window data
    vanilla.py        # Full VLM reasoning per step
    visual.py         # FastSAM segmentation
    visual_browser.py # DOM-based segments
  adapters/
    anthropic.py      # Claude API integration
    openai.py         # GPT API integration
    replicate.py      # Open-source model hosting
  privacy/
    base.py           # Scrubbing provider interface
    providers/        # Presidio, AWS Comprehend, Private AI

The Strategy Pattern (Original)¶

The original architecture used a BaseReplayStrategy abstract class:

class BaseReplayStrategy(ABC):
    """Base class for implementing replay strategies."""

    def __init__(self, recording: Recording) -> None:
        self.recording = recording
        self.action_events = []
        self.screenshots = []
        self.window_events = []

    @abstractmethod
    def get_next_action_event(
        self,
        screenshot: Screenshot,
        window_event: WindowEvent,
    ) -> ActionEvent:
        """Get the next action based on current observation."""
        pass

    def run(self) -> None:
        """Execute the replay loop."""
        while True:
            screenshot = Screenshot.take_screenshot()
            window_event = WindowEvent.get_active_window_event()
            action_event = self.get_next_action_event(screenshot, window_event)
            if action_event:
                playback.play_action_event(action_event, ...)

This pattern evolved into the modern policy/grounding separation.

Alpha Data Model¶

class Recording:
    """Container for a demonstration session."""
    id: int
    timestamp: float
    task_description: str
    action_events: list[ActionEvent]
    screenshots: list[Screenshot]
    window_events: list[WindowEvent]

class ActionEvent:
    """A single user action (click, type, scroll, etc.)."""
    name: str                    # "click", "type", "scroll", "press", "release"
    timestamp: float
    screenshot: Screenshot       # Screenshot just before action
    window_event: WindowEvent    # Active window state
    mouse_x, mouse_y: int        # Mouse coordinates
    key_char, key_name: str      # Keyboard input
    element_state: dict          # Accessibility info

class Screenshot:
    """A captured screen image."""
    timestamp: float
    png_data: bytes
    image: PIL.Image

2. The Abstraction Ladder¶

Core Concept: Progressive Abstraction¶

OpenAdapt processes demonstrations through ascending levels of abstraction, enabling generalization and transfer learning.

+=========================================================================+
|                                                                         |
|  Level 4: GOAL (Task Specification)                              FUTURE |
|  "Say hello to the customer"                                            |
|                                                                         |
|           ^                                                             |
|           | Goal Composition (high-level planning)                      |
|           |                                                             |
+=========================================================================+
|                                                                         |
|  Level 3: SEMANTIC (Intent Recognition)                         FUTURE |
|  { action: "greet", target: "user" }                                    |
|                                                                         |
|           ^                                                             |
|           | Process Mining (discover patterns)                          |
|           |                                                             |
+=========================================================================+
|                                                                         |
|  Level 2: TEMPLATE (Parameterized Actions)                      PARTIAL |
|  { type: "hi <firstname>" }                                             |
|                                                                         |
|           ^                                                             |
|           | Anonymization (extract parameters)                          |
|           |                                                             |
+=========================================================================+
|                                                                         |
|  Level 1: SYMBOLIC (Semantic Actions)                       IMPLEMENTED |
|  { type: "hi bob" }                                                     |
|                                                                         |
|           ^                                                             |
|           | Reduction (aggregate consecutive events)                    |
|           |                                                             |
+=========================================================================+
|                                                                         |
|  Level 0: LITERAL (Raw Events)                              IMPLEMENTED |
|  { press: "h" }, { press: "i" }, { press: " " }, { press: "b" }, ...    |
|                                                                         |
+=========================================================================+

Abstraction Level Details¶

Level	Name	Representation	Transformation	Status
0	Literal	Raw keypresses, mouse coords	None (raw capture)	Implemented
1	Symbolic	Aggregated actions (`type "hello"`)	Event reduction	Implemented
2	Template	Parameterized (`type "<greeting>"`)	Regex extraction	Partial
3	Semantic	Intent-level (`greet user`)	LLM intent recognition	Research
4	Goal	Task description ("Welcome customer")	Goal composition	Future

Why Abstraction Matters¶

Level	Enables	Example Use Case
Literal	Exact replay, debugging	Audit trails, regression tests
Symbolic	Human-readable logs	Training data visualization
Template	Parameterized replay	Same task, different data
Semantic	Cross-application transfer	Greeting in any messaging app
Goal	Natural language control	"Greet the next customer"

Current Implementation¶

Literal to Symbolic (openadapt-capture): - Event aggregation in events.py - Consecutive keypresses become type actions - Mouse drags become drag actions - Click sequences become doubleclick or tripleclick

Symbolic to Template (Partial): - Regex-based parameter extraction - User-defined placeholders

Template to Semantic (Research): - LLM-based intent recognition - Pattern library discovery

Semantic to Goal (Future): - Process mining algorithms - Cross-demo pattern extraction

3. Core Innovation: Demo-Conditioned Agents¶

The Fundamental Differentiator¶

OpenAdapt's core insight is demonstration-conditioned automation: "show, don't tell."

+-------------------------------------------------------------------+
|                     TRADITIONAL APPROACH                           |
+-------------------------------------------------------------------+
|                                                                    |
|  User: "Click the submit button"                                   |
|                                                                    |
|  Agent: [Which submit button? What context? What state?]           |
|         [Multiple submit buttons on page?]                         |
|         [Different applications have different buttons]            |
|                                                                    |
|  Result: AMBIGUOUS -> Requires prompt engineering                  |
|                                                                    |
+-------------------------------------------------------------------+

+-------------------------------------------------------------------+
|                    DEMO-CONDITIONED APPROACH                       |
+-------------------------------------------------------------------+
|                                                                    |
|  User: [Records clicking the blue "Submit Order" button            |
|         after filling out form fields]                             |
|                                                                    |
|  Agent: [Learns full context:                                      |
|          - Form state before action                                |
|          - Button appearance and location                          |
|          - Preceding actions in sequence                           |
|          - Window/application context]                             |
|                                                                    |
|  Result: GROUNDED -> No prompt engineering needed                  |
|                                                                    |
+-------------------------------------------------------------------+

Why Demo-Conditioning Works¶

Captures Implicit Knowledge: Users demonstrate things they can't easily verbalize
Grounded in Reality: Actions tied to actual UI states, not abstract descriptions
Reduces Ambiguity: Visual context eliminates interpretation errors
Lower Barrier: No prompt engineering skills required

Empirical Results¶

Demo conditioning improves first-action accuracy:

Approach	First-Action Accuracy	Notes
Prompt-only	~33%	Ambiguity in action selection
Demo-conditioned	~100%	Full context from demonstration

The "Show, Don't Tell" Principle¶

# Traditional: Prompt-driven
agent.execute("Click the submit button")
# -> Which submit button? What state? What context?

# Demo-Conditioned: Demonstration-driven
demo = capture_demonstration()  # User clicks specific submit button
agent = train_policy(demo)      # Agent learns the full context
agent.execute(new_context)      # Agent adapts to variations

4. Modern Architecture¶

Evolution: Monolith to Meta-Package¶

ALPHA (2023-2024)                        MODERN (2025+)
+====================+                   +====================+
|                    |                   |     openadapt      |
|     openadapt      |                   |    (meta-pkg)      |
|    (monolithic)    |                   +=========+=========+
|                    |                             |
|  - record.py       |           +-----------------+-----------------+
|  - replay.py       |           |         |       |       |         |
|  - strategies/     |      +----+----+ +--+--+ +--+--+ +--+--+ +----+----+
|  - models.py       |      |capture  | | ml  | |evals| |viewer| |optional |
|  - adapters/       |      +---------+ +-----+ +-----+ +------+ +---------+
|  - privacy/        |
|  - visualize.py    |      + grounding, retrieval, privacy
|                    |
+====================+

The Modern Three-Phase Architecture¶

Building on the alpha vision, the modern architecture formalizes three phases:

+=======================+     +=======================+     +=======================+
||                     ||     ||                     ||     ||                     ||
||     DEMONSTRATE     || --> ||       LEARN         || --> ||      EXECUTE        ||
||                     ||     ||                     ||     ||                     ||
||  (Observation       ||     ||  (Policy            ||     ||  (Agent             ||
||   Collection)       ||     ||   Acquisition)      ||     ||   Deployment)       ||
||                     ||     ||                     ||     ||                     ||
||  Packages:          ||     ||  Packages:          ||     ||  Packages:          ||
||  - capture          ||     ||  - ml               ||     ||  - evals            ||
||  - privacy          ||     ||  - retrieval        ||     ||  - grounding        ||
||                     ||     ||                     ||     ||                     ||
+=======================+     +=======================+     +=======================+

Phase 1: DEMONSTRATE (Observation Collection)¶

Purpose: Capture rich trajectories from human demonstrations.

Inputs: - User performs task in target application - Optional: Task description, success criteria, audio narration

Outputs: - Screenshot sequences (PNG/JPEG) - Input events (mouse, keyboard, touch) - Accessibility tree snapshots (a11y) - Window metadata (title, bounds, process) - Audio transcription (optional)

Packages: openadapt-capture, openadapt-privacy

Phase 2: LEARN (Policy Acquisition)¶

Purpose: Transform demonstrations into executable agent policies.

Three Learning Paths:

Path	Mechanism	Advantage	Package
A: Retrieval-Augmented	Index demos, retrieve similar	No training needed	`openadapt-retrieval`
B: Fine-Tuning	Train VLM on demo dataset	Specialized performance	`openadapt-ml`
C: Process Mining	Extract reusable patterns	Cross-task transfer	`openadapt-ml` (future)

Phase 3: EXECUTE (Agent Deployment)¶

Purpose: Run trained/conditioned agents autonomously.

Execution Loop:

while not task_complete:
    1. OBSERVE    - Capture screenshot + a11y tree
    2. GROUND     - Localize UI elements (SoM, OmniParser)
    3. PLAN       - VLM reasoning with demo context
    4. ACT        - Execute via input synthesis
    5. EVALUATE   - Check success, decide next step

Packages: openadapt-evals, openadapt-grounding, openadapt-ml

5. SOTA GUI Agent Integration¶

Policy/Grounding Separation¶

From Claude Computer Use, UFO, and SeeAct research:

+====================+          +====================+
|                    |          |                    |
|      POLICY        |    -->   |     GROUNDING      |
|                    |          |                    |
|   "What to do"     |          |   "Where to do"    |
|                    |          |                    |
| - Observation      |          | - Element          |
|   encoding         |          |   detection        |
| - Action           |          | - Coordinate       |
|   selection        |          |   mapping          |
| - History          |          | - Bounding         |
|   context          |          |   boxes            |
|                    |          |                    |
+====================+          +====================+

OpenAdapt Implementation: - Policy: openadapt-ml adapters (Claude, GPT-4V, Qwen-VL) - Grounding: openadapt-grounding providers (OmniParser, Florence2, Gemini)

Set-of-Mark (SoM) Prompting¶

From Microsoft's Set-of-Mark paper:

Original Screenshot              SoM-Annotated Screenshot
+---------------------+          +---------------------+
|  [Login]   [Help]   |          |   [1]      [2]      |
|                     |    ->    |                     |
|  Email: [________]  |          |  Email: [3]         |
|  Pass:  [________]  |          |  Pass:  [4]         |
|         [Submit]    |          |         [5]         |
+---------------------+          +---------------------+

Prompt: "Enter email in element [3], password in [4], click [5]"

OpenAdapt Implementation: openadapt-grounding.SoMPrompt

Safety Gates¶

From responsible AI patterns:

+------------------+     +------------------+     +------------------+
|                  |     |                  |     |                  |
|     OBSERVE      | --> |    VALIDATE      | --> |      ACT         |
|                  |     |                  |     |                  |
|  Get current     |     |  - Check bounds  |     |  Execute if      |
|  state           |     |  - Verify perms  |     |  validated       |
|                  |     |  - Rate limit    |     |                  |
+------------------+     +--------+---------+     +------------------+
                                 |
                                 v (rejected)
                         +------------------+
                         |    ESCALATE      |
                         |  Human review    |
                         +------------------+

Status: Planned in openadapt-evals safety module.

Research Alignment¶

Research Paper	Key Contribution	OpenAdapt Integration
Claude Computer Use (Anthropic, 2024)	Production VLM agent API	API adapter in `openadapt-ml`
UFO (Microsoft, 2024)	Windows agent architecture	Prompt patterns adopted
OSWorld (CMU, 2024)	Cross-platform benchmark	Benchmark adapter planned
Set-of-Mark (Microsoft, 2023)	Visual grounding via labels	Core grounding mode
OmniParser (Microsoft, 2024)	Pure-vision UI parsing	Provider in `openadapt-grounding`
SeeAct (OSU, 2024)	Grounded action generation	Action space design
WebArena (CMU, 2023)	Web automation benchmark	Benchmark adapter implemented
AppAgent (Tencent, 2024)	Mobile GUI agent	Mobile support planned

6. Package Responsibilities¶

Package-to-Phase Mapping¶

+===============================================================================+
|                           DEMONSTRATE PHASE                                    |
+===============================================================================+
| Package           | Responsibility             | Key Exports                  |
+-------------------+----------------------------+------------------------------+
| openadapt-capture | GUI recording, storage     | Recorder, CaptureSession     |
|                   |                            | Action, Screenshot, Trajectory|
+-------------------+----------------------------+------------------------------+
| openadapt-privacy | PII/PHI scrubbing          | Scrubber, Redactor           |
|                   | (integrates at capture)    | PrivacyFilter                |
+===============================================================================+

+===============================================================================+
|                             LEARN PHASE                                        |
+===============================================================================+
| Package             | Responsibility           | Key Exports                  |
+---------------------+--------------------------+------------------------------+
| openadapt-ml        | Model training,          | Trainer, AgentPolicy         |
|                     | inference, adapters      | QwenVLAdapter, ClaudeAdapter |
+---------------------+--------------------------+------------------------------+
| openadapt-retrieval | Demo embedding,          | DemoIndex, Embedder          |
|                     | similarity search        | SearchResult                 |
+===============================================================================+

+===============================================================================+
|                            EXECUTE PHASE                                       |
+===============================================================================+
| Package              | Responsibility          | Key Exports                  |
+----------------------+-------------------------+------------------------------+
| openadapt-evals      | Benchmark evaluation,   | BenchmarkAdapter, ApiAgent   |
|                      | metrics collection      | evaluate_agent_on_benchmark  |
+----------------------+-------------------------+------------------------------+
| openadapt-grounding  | UI element detection,   | ElementDetector, SoMPrompt   |
|                      | coordinate mapping      | OmniParser, GeminiGrounder   |
+===============================================================================+

+===============================================================================+
|                           CROSS-CUTTING                                        |
+===============================================================================+
| Package           | Responsibility             | Key Exports                  |
+-------------------+----------------------------+------------------------------+
| openadapt-viewer  | HTML visualization,        | PageBuilder, HTMLBuilder     |
|                   | trajectory replay          | TrajectoryViewer             |
+-------------------+----------------------------+------------------------------+
| openadapt         | Unified CLI,               | cli.main, lazy imports       |
| (meta-package)    | dependency coordination    |                              |
+===============================================================================+

Package Dependency Matrix¶

                    capture  ml  evals  viewer  grounding  retrieval  privacy
openadapt-capture      -     -    -       -        -          -          O
openadapt-ml           R     -    -       -        O          O          -
openadapt-evals        -     R    -       O        O          O          -
openadapt-viewer       O     O    O       -        -          -          O
openadapt-grounding    -     -    -       -        -          -          -
openadapt-retrieval    R     -    -       -        -          -          -
openadapt-privacy      -     -    -       -        -          -          -

Legend: R = Required, O = Optional, - = None

7. Feedback Loops¶

System-Level Feedback Architecture¶

                              DEMONSTRATE
                                   |
                                   | Human demonstrations
                                   v
+-----------------------------> LEARN <----------------------------+
|                                  |                               |
|                                  | Trained policies              |
|   +-----------------------------|---------------------+          |
|   |                             v                     |          |
|   |   +----------------->  EXECUTE  <--------------+  |          |
|   |   |                       |                    |  |          |
|   |   | Retry on              | Success/Failure   |  |          |
|   |   | recoverable           | outcomes          |  |          |
|   |   | errors                v                   |  |          |
|   |   |               +-------+-------+           |  |          |
|   |   |               |               |           |  |          |
|   |   +---------------+   EVALUATE    +-----------+  |          |
|   |   (Loop 1: Retry) |               |              |          |
|   |                   +-------+-------+              |          |
|   |                           |                      |          |
|   |                           | Execution traces     |          |
|   |                           v                      |          |
|   |                   Demo library grows             |          |
|   |                           |                      |          |
|   +---------------------------+                      |          |
|   (Loop 2: Library Growth)                           |          |
|                                                      |          |
|   Failure analysis identifies gaps                   |          |
|                    |                                 |          |
|                    v                                 |          |
|           Human correction                           |          |
|                    |                                 |          |
+--------------------+                                 |          |
(Loop 3: Human-in-Loop)                                |          |
                                                       |          |
                     Self-improvement loop             |          |
                     (execution traces -> training)    |          |
                              |                        |          |
                              +------------------------+          |
                              (Loop 4: Self-Improvement)          |
                                                                  |
             Benchmark-driven development                         |
             (eval results -> architecture improvements)          |
                              |                                   |
                              +-----------------------------------+
                              (Loop 5: Benchmark-Driven)

Feedback Loop Details¶

Loop	Name	Trigger	Outcome	Status
1	Retry	Recoverable error	Re-attempt action	Implemented
2	Library Growth	Successful execution	New demo added	Implemented
3	Human-in-Loop	Unrecoverable failure	Human correction -> demo	Implemented
4	Self-Improvement	Execution traces	Fine-tuning	Research
5	Benchmark-Driven	Eval metrics	Architecture changes	Active

8. Implementation Status¶

What's Implemented vs Future Work¶

+==============================================================================+
|                    IMPLEMENTED (Solid)                                        |
+==============================================================================+
| Component                | Package          | Notes                          |
+--------------------------+------------------+--------------------------------+
| Screen capture           | capture          | macOS, Windows, Linux          |
| Event recording          | capture          | Mouse, keyboard, touch         |
| Event aggregation        | capture          | Literal -> Symbolic            |
| A11y tree capture        | capture          | macOS, Windows                 |
| Demo storage             | capture          | JSON/Parquet/PNG               |
| Privacy scrubbing        | privacy          | Presidio, AWS Comprehend       |
| Demo embedding           | retrieval        | CLIP, SigLIP                   |
| Vector indexing          | retrieval        | FAISS, Annoy                   |
| Similarity search        | retrieval        | Top-k retrieval                |
| API model adapters       | ml               | Claude, GPT-4V, Gemini         |
| Element detection        | grounding        | OmniParser, Florence2          |
| SoM annotation           | grounding        | Numbered element labels        |
| WAA benchmark            | evals            | Full integration               |
| Mock benchmark           | evals            | Testing infrastructure         |
| HTML visualization       | viewer           | Trajectory replay              |
| Unified CLI              | openadapt        | capture/train/eval/view        |
+==============================================================================+

+==============================================================================+
|                    IN PROGRESS (Dashed)                                       |
+==============================================================================+
| Component                | Package          | Notes                          |
+--------------------------+------------------+--------------------------------+
| Training pipeline        | ml               | Qwen-VL fine-tuning            |
| LoRA adapters            | ml               | Parameter-efficient training   |
| Template extraction      | capture          | Regex-based parameterization   |
| WebArena benchmark       | evals            | Browser automation             |
| Training dashboard       | viewer           | Loss/metrics visualization     |
| Audio transcription      | capture          | Whisper integration            |
+--------------------------+------------------+--------------------------------+

+==============================================================================+
|                    FUTURE WORK (Dotted)                                       |
+==============================================================================+
| Component                | Package          | Notes                          |
+--------------------------+------------------+--------------------------------+
| Process mining           | ml (future)      | Semantic action discovery      |
| Goal composition         | ml (future)      | High-level task planning       |
| Self-improvement         | ml (future)      | Training on execution traces   |
| OSWorld benchmark        | evals            | Cross-platform desktop         |
| Multi-agent collaboration| ml (future)      | Agent coordination             |
| Active learning          | ml (future)      | Human feedback integration     |
| Mobile platform          | capture          | iOS, Android                   |
| Safety gates             | evals            | Action validation layer        |
+==============================================================================+

Abstraction Ladder Implementation Status¶

Level	Name	Status	Implementation
0	Literal	Implemented	Raw event recording in `capture`
1	Symbolic	Implemented	Event aggregation in `capture`
2	Template	Partial	Regex extraction in `capture`
3	Semantic	Research	LLM intent recognition
4	Goal	Future	Process mining

9. Architecture Evolution Diagrams¶

Era 1: Alpha Monolith (2023)¶

+=========================================================================+
|                        ALPHA ARCHITECTURE (2023)                         |
+=========================================================================+
|                                                                          |
|   +------------------------------------------------------------------+  |
|   |                       openadapt (monolithic)                      |  |
|   +------------------------------------------------------------------+  |
|   |                                                                   |  |
|   |   +-------------+    +-------------+    +-------------+          |  |
|   |   |   record    | -> |  visualize  | -> |   replay    |          |  |
|   |   +-------------+    +-------------+    +-------------+          |  |
|   |         |                  |                  |                  |  |
|   |         v                  v                  v                  |  |
|   |   +-------------+    +-------------+    +------------------+     |  |
|   |   |   models    |    |  plotting   |    |   strategies/    |     |  |
|   |   | - Recording |    | - HTML gen  |    |   - base.py      |     |  |
|   |   | - ActionEvt |    |             |    |   - naive.py     |     |  |
|   |   | - Screenshot|    |             |    |   - vanilla.py   |     |  |
|   |   | - WindowEvt |    |             |    |   - visual.py    |     |  |
|   |   +-------------+    +-------------+    +------------------+     |  |
|   |         |                                       |                |  |
|   |         v                                       v                |  |
|   |   +-------------+                       +---------------+        |  |
|   |   |    db/      |                       |   adapters/   |        |  |
|   |   | - SQLite    |                       | - anthropic   |        |  |
|   |   | - CRUD ops  |                       | - openai      |        |  |
|   |   +-------------+                       | - replicate   |        |  |
|   |                                         +---------------+        |  |
|   +------------------------------------------------------------------+  |
|                                                                          |
+=========================================================================+

Characteristics:
- Single repository, single package
- Tightly coupled components
- Strategy pattern for replay variants
- SQLite + Alembic migrations
- Prompts embedded in code

Era 2: Transition (2024)¶

+=========================================================================+
|                     TRANSITION ARCHITECTURE (2024)                       |
+=========================================================================+
|                                                                          |
|   Legacy codebase frozen -> /legacy/                                     |
|                                                                          |
|   New modular packages designed:                                         |
|                                                                          |
|   +-------------+  +-------------+  +-------------+  +-------------+    |
|   |   capture   |  |     ml      |  |    evals    |  |   viewer    |    |
|   +-------------+  +-------------+  +-------------+  +-------------+    |
|   |   privacy   |  |  retrieval  |  |  grounding  |                     |
|   +-------------+  +-------------+  +-------------+                     |
|                                                                          |
|   Key changes:                                                           |
|   - Separate PyPI packages                                               |
|   - Lazy imports for optional deps                                       |
|   - Unified CLI in meta-package                                          |
|   - Policy/grounding separation                                          |
|   - Benchmark-first development                                          |
|                                                                          |
+=========================================================================+

Era 3: Modern Meta-Package (2025+)¶

+=========================================================================+
|                    MODERN ARCHITECTURE (2025+)                           |
+=========================================================================+
|                                                                          |
|                         +------------------+                             |
|                         |    User Layer    |                             |
|                         |   CLI / Web UI   |                             |
|                         +--------+---------+                             |
|                                  |                                       |
|                                  v                                       |
|                         +------------------+                             |
|                         |    openadapt     |                             |
|                         |   (meta-package) |                             |
|                         +--------+---------+                             |
|                                  |                                       |
|         +------------------------+------------------------+              |
|         |            |           |           |            |              |
|         v            v           v           v            v              |
|   +---------+  +---------+  +---------+  +---------+  +--------+        |
|   | capture |  |   ml    |  |  evals  |  | viewer  |  |optional|        |
|   +---------+  +---------+  +---------+  +---------+  +--------+        |
|       |            |            |            |            |              |
|       v            v            v            v            v              |
|   +---------------------------------------------------------------+     |
|   |                      Shared Interfaces                         |     |
|   |  - Trajectory format (JSON/Parquet)                           |     |
|   |  - Action space specification                                  |     |
|   |  - Observation schema                                          |     |
|   |  - Benchmark protocols                                         |     |
|   +---------------------------------------------------------------+     |
|                                  |                                       |
|                                  v                                       |
|   +---------------------------------------------------------------+     |
|   |                      Model Layer                               |     |
|   |  +----------+  +----------+  +----------+  +----------+       |     |
|   |  |  Claude  |  |  GPT-4V  |  |  Gemini  |  | Qwen-VL  |       |     |
|   |  +----------+  +----------+  +----------+  +----------+       |     |
|   +---------------------------------------------------------------+     |
|                                                                          |
+=========================================================================+

Full System Architecture (Mermaid)¶

flowchart TB
    subgraph User["User Layer"]
        CLI[openadapt CLI]
        UI[Desktop/Web GUI]
    end

    subgraph Phase1["DEMONSTRATE"]
        direction TB
        REC[Recorder<br/>openadapt-capture]
        SCRUB[Privacy Scrubber<br/>openadapt-privacy]
        STORE[(Demo Library)]

        REC --> SCRUB
        SCRUB --> STORE
    end

    subgraph Phase2["LEARN"]
        direction TB

        subgraph Retrieval["Retrieval Path"]
            EMB[Embedder]
            IDX[Vector Index]
            SEARCH[Similarity Search]
        end

        subgraph Training["Training Path"]
            LOADER[Data Loader]
            TRAINER[Model Trainer]
            CKPT[(Checkpoints)]
        end

        subgraph Mining["Process Mining"]
            ABSTRACT[Abstractor]
            PATTERNS[Pattern Library]
        end
    end

    subgraph Phase3["EXECUTE"]
        direction TB
        OBS[1. OBSERVE]
        GROUND[2. GROUND<br/>openadapt-grounding]
        PLAN[3. PLAN<br/>Demo-Conditioned]
        ACT[4. ACT]
        EVAL[5. EVALUATE<br/>openadapt-evals]

        OBS --> GROUND
        GROUND --> PLAN
        PLAN --> ACT
        ACT --> EVAL
        EVAL -->|retry| OBS
    end

    subgraph Models["Model Layer"]
        direction LR
        CLAUDE[Claude]
        GPT[GPT-4o]
        GEMINI[Gemini]
        QWEN[Qwen-VL]
    end

    subgraph Viewer["Cross-Cutting: Viewer"]
        VIZ[Visualization]
        REPLAY[Replay]
        DASH[Dashboard]
    end

    %% User interactions
    CLI --> REC
    UI --> REC
    CLI --> TRAINER
    CLI --> EVAL

    %% Demo flow
    STORE --> EMB
    STORE --> LOADER
    STORE --> ABSTRACT

    EMB --> IDX
    IDX --> SEARCH
    LOADER --> TRAINER
    TRAINER --> CKPT
    ABSTRACT --> PATTERNS

    %% Execution flow (demo-conditioning)
    SEARCH -->|demo context| PLAN
    CKPT -->|policy| PLAN
    PATTERNS -.->|templates| PLAN

    %% Model connections
    PLAN --> Models
    GROUND --> Models

    %% Viewer connections
    STORE -.-> VIZ
    CKPT -.-> DASH
    EVAL -.-> DASH

    %% Feedback loops
    EVAL -->|success trace| STORE
    EVAL -->|failure| User

    %% Styling
    classDef userLayer fill:#E74C3C,stroke:#A93226,color:#fff
    classDef phase1 fill:#3498DB,stroke:#1A5276,color:#fff
    classDef phase2 fill:#27AE60,stroke:#1E8449,color:#fff
    classDef phase3 fill:#9B59B6,stroke:#6C3483,color:#fff
    classDef models fill:#F39C12,stroke:#B7950B,color:#fff
    classDef viewer fill:#1ABC9C,stroke:#148F77,color:#fff
    classDef future fill:#95A5A6,stroke:#707B7C,color:#fff,stroke-dasharray: 5 5

    class CLI,UI userLayer
    class REC,SCRUB,STORE phase1
    class EMB,IDX,SEARCH,LOADER,TRAINER,CKPT phase2
    class ABSTRACT,PATTERNS future
    class OBS,GROUND,PLAN,ACT,EVAL phase3
    class CLAUDE,GPT,GEMINI,QWEN models
    class VIZ,REPLAY,DASH viewer

Execution Loop Evolution¶

ALPHA: Strategy-Based                    MODERN: Policy/Grounding
================================         ================================

+------------------+                     +------------------+
|  BaseReplay      |                     |     OBSERVE      |
|  Strategy        |                     |  (Screenshot +   |
|                  |                     |   A11y tree)     |
|  while True:     |                     +--------+---------+
|    screenshot =  |                              |
|      take()      |                              v
|    action =      |                     +------------------+
|      get_next()  |      ------>        |     GROUND       |
|    play(action)  |                     |  (Element detect |
|                  |                     |   + SoM annotate)|
+------------------+                     +--------+---------+
                                                  |
                                                  v
                                         +------------------+
                                         |      PLAN        |
                                         |  (VLM reasoning  |
                                         |   + demo context)|
                                         +--------+---------+
                                                  |
                                                  v
                                         +------------------+
                                         |       ACT        |
                                         |  (Input synth +  |
                                         |   safety check)  |
                                         +--------+---------+
                                                  |
                                                  v
                                         +------------------+
                                         |    EVALUATE      |
                                         |  (Success check  |
                                         |   + feedback)    |
                                         +------------------+

Package Responsibility Diagram¶

flowchart LR
    subgraph demonstrate["DEMONSTRATE Phase"]
        CAP[openadapt-capture]
        PRV[openadapt-privacy]
    end

    subgraph learn["LEARN Phase"]
        RET[openadapt-retrieval]
        ML[openadapt-ml]
    end

    subgraph execute["EXECUTE Phase"]
        GRD[openadapt-grounding]
        EVL[openadapt-evals]
    end

    subgraph crosscut["Cross-Cutting"]
        VWR[openadapt-viewer]
    end

    subgraph meta["Meta-Package"]
        OA[openadapt CLI]
    end

    %% CLI orchestrates all
    OA --> CAP
    OA --> ML
    OA --> EVL
    OA --> VWR

    %% Phase 1 flow
    CAP -->|raw data| PRV
    PRV -->|scrubbed data| ML
    PRV -->|scrubbed data| RET

    %% Phase 2 flow
    RET -->|embeddings| EVL
    ML -->|checkpoints| EVL

    %% Phase 3 flow
    GRD -->|coordinates| EVL
    GRD -->|coordinates| ML

    %% Viewer integration
    CAP -.->|demos| VWR
    ML -.->|training| VWR
    EVL -.->|results| VWR

    %% Styling
    classDef phase1 fill:#3498DB,stroke:#1A5276,color:#fff
    classDef phase2 fill:#27AE60,stroke:#1E8449,color:#fff
    classDef phase3 fill:#9B59B6,stroke:#6C3483,color:#fff
    classDef cross fill:#1ABC9C,stroke:#148F77,color:#fff
    classDef meta fill:#2C3E50,stroke:#1A252F,color:#fff

    class CAP,PRV phase1
    class RET,ML phase2
    class GRD,EVL phase3
    class VWR cross
    class OA meta

Feedback Loop Diagram¶

flowchart TB
    subgraph Loop1["Loop 1: Demo Library Growth"]
        D1[Human Demo] --> L1[Learn]
        L1 --> E1[Execute]
        E1 --> |success| S1[Store as Demo]
        E1 --> |failure| F1[Failure Analysis]
        F1 --> |new demo| D1
        S1 --> L1
    end

    subgraph Loop2["Loop 2: Self-Improvement"]
        E2[Execute] --> T2[Trace]
        T2 --> |success trace| FT2[Fine-tune]
        FT2 --> E2
    end

    subgraph Loop3["Loop 3: Benchmark-Driven"]
        B3[Benchmark] --> M3[Metrics]
        M3 --> A3[Architecture Improvement]
        A3 --> B3
    end

    Loop1 -.->|traces| Loop2
    Loop2 -.->|models| Loop1
    Loop1 -.->|metrics| Loop3
    Loop3 -.->|improvements| Loop1

    %% Styling
    classDef loop1 fill:#3498DB,stroke:#1A5276,color:#fff
    classDef loop2 fill:#27AE60,stroke:#1E8449,color:#fff,stroke-dasharray: 5 5
    classDef loop3 fill:#9B59B6,stroke:#6C3483,color:#fff

    class D1,L1,E1,S1,F1 loop1
    class E2,T2,FT2 loop2
    class B3,M3,A3 loop3

10. Future Directions¶

Near-Term (Q1 2026)¶

Priority	Goal	Package	Status
P0	PyPI releases for all packages	all	In progress
P0	WAA baseline metrics established	evals	Pending
P1	Fine-tuning pipeline validated	ml	In progress
P1	Demo conditioning in evals	evals + retrieval	Pending
P2	docs.openadapt.ai launched	docs	Pending

Medium-Term (2026)¶

Goal	Description
Process Mining	Automatic extraction of semantic actions from demos
Self-Improvement	Training on successful execution traces
Multi-Benchmark	WebArena + OSWorld integration
Enterprise Deployment	Production deployment guides

Long-Term (2026+)¶

Goal	Description
Cross-App Transfer	Demos from Excel help with Google Sheets
Multi-Agent	Coordinated agents for complex workflows
Active Learning	Agents request human help strategically
Mobile Platforms	iOS and Android capture/replay

Research Questions¶

Abstraction Discovery: Can we automatically extract semantic actions from literal event sequences?
Transfer Learning: How much does demo conditioning help across different applications?
Explanation: How do we make agent decisions interpretable to users?
Safety: What guardrails prevent harmful autonomous actions?

Appendix A: Glossary¶

Term	Definition
A11y Tree	Accessibility tree - structured UI element representation
Demo	Recorded human demonstration (trajectory)
Grounding	Mapping text/intent to specific UI coordinates
LoRA	Low-Rank Adaptation - efficient fine-tuning method
Policy	Decision function mapping observations to actions
SoM	Set-of-Mark - visual grounding via numbered labels
Trajectory	Sequence of (observation, action) pairs
VLM	Vision-Language Model
WAA	Windows Agent Arena benchmark

Architecture Overview - Package structure and data flow
Roadmap Priorities - Current development priorities
Package Documentation - Individual package guides
Legacy Freeze - Migration from monolith

Appendix C: Version History¶

Version	Date	Changes
3.0	Jan 2026	Alpha vision synthesis, evolution diagrams, SOTA alignment
2.0	Jan 2026	Comprehensive redesign, modular architecture
1.0	Dec 2025	Initial modular architecture
0.x	2023-2024	Legacy monolithic design

This document is maintained as part of the OpenAdapt project. For updates, see the GitHub repository.