OpenAdapt Publication Roadmap: A Critical Assessment¶
Version: 2.0 Date: January 2026 Status: Honest Evaluation Author: OpenAdapt Research Team
Preamble: Intellectual Honesty¶
This document is written from the perspective of a skeptical reviewer at a top venue. The goal is not to inflate claims but to identify what is genuinely publishable, what experiments are actually needed, and what timeline is realistic given current resources.
Guiding principle: Better to publish a solid workshop paper than to submit an overreaching main track paper that gets rejected.
Table of Contents¶
- Current State of Evidence
- Honest Contribution Assessment
- Weakness Analysis
- Required Experiments for Defensible Claims
- Statistical Rigor Requirements
- Related Work Gap Analysis
- Venue Fit Analysis
- Realistic Timeline
- Risk Mitigation
- Action Items
- Path to Main Track Publication (Parallel Track)
1. Current State of Evidence¶
1.1 What We Actually Have¶
| Experiment | n | Result | Statistical Validity | Benchmark |
|---|---|---|---|---|
| macOS demo-conditioning (first-action) | 45 | 46.7% -> 100% | Moderate (single model, single platform) | Non-standard |
| WAA baseline (interrupted) | 8 | 12.5% success | Weak (incomplete, agent bugs) | Standard |
| Length-matched control | 45 | 57.8% | Useful (rules out token length) | Non-standard |
1.2 Critical Assessment of Current Results¶
The 100% first-action accuracy claim: - Scope: All 45 tasks share the SAME correct first action (click Apple menu) - Implication: This measures whether a demo can transfer procedural entry points, NOT general task-solving - Limitation: Not comparable to any published benchmark - Honest framing: "Demo-conditioning eliminates spatial bias in navigation initialization"
The WAA baseline: - Status: ⅛ tasks passed (12.5%) - Problem: Run was interrupted; agent had bugs unrelated to our method - Implication: We do not yet have a clean zero-shot baseline on a standard benchmark
1.3 What We Do NOT Have¶
- Standard benchmark results - No complete WAA, WebArena, or OSWorld evaluation
- Multi-model comparison - Only Claude Sonnet 4.5 tested
- Episode success rate - Only first-action accuracy measured
- Statistical significance tests - No p-values, confidence intervals, or effect sizes
- Ablation studies - No systematic ablation of demo components
- Retrieval experiments - Retrieval system not evaluated
- User studies - No human evaluation of system usability
2. Honest Contribution Assessment¶
2.1 What Is ACTUALLY Novel?¶
| Claimed Contribution | Novelty Assessment | Prior Work |
|---|---|---|
| Demo-conditioned GUI agents | Moderate - PbD is old; VLM+demo is emerging | UINav (2023), SUGILITE (2017) |
| "Show don't tell" paradigm | Low - Standard few-shot prompting | GPT-3 (2020), chain-of-thought |
| Multimodal demo retrieval | Moderate - Novel application to GUI domain | RAG literature extensive |
| Modular architecture | Low - Engineering contribution | Many open-source frameworks |
| Cross-platform support | Low - Engineering contribution | SeeAct, UFO also support multiple platforms |
2.2 Defensible Novel Claims¶
After honest assessment, the defensible novel contribution is:
Demonstration-conditioned prompting for VLM-based GUI agents: We show that providing a human demonstration in the VLM prompt substantially improves action selection accuracy compared to instruction-only prompting. This is a prompting strategy, not a new model architecture or training method.
This is NOT: - A new model architecture - A training/fine-tuning method - A new benchmark - A theoretical contribution
2.3 Contribution Positioning¶
Honest positioning: This is an empirical study showing that a simple prompting intervention (including demonstrations) improves GUI agent performance. The contribution is:
- Empirical finding: Demonstrations help, and we quantify by how much
- Analysis: We explain WHY (spatial bias, procedural priors)
- Practical method: We provide an open-source implementation
What reviewers will say: "This is straightforward few-shot prompting applied to GUI agents. What is technically novel?"
Our response must be: "The contribution is empirical, not algorithmic. We systematically evaluate demo-conditioning across N tasks and M models, providing the first rigorous study of this prompting strategy for GUI automation."
3. Weakness Analysis¶
3.1 Anticipated Reviewer Criticisms¶
| Criticism | Severity | Our Current Status | Mitigation |
|---|---|---|---|
| "All tasks share the same first action" | Critical | True - intentional design | Expand to diverse first actions |
| "Only one model tested" | High | True | Add GPT-4V, Gemini |
| "Non-standard benchmark" | High | True | Complete WAA evaluation |
| "No episode success rate" | High | True | Run multi-step evaluation |
| "Small sample size" | Medium | n=45 is reasonable | Add more tasks |
| "No statistical tests" | Medium | True | Add McNemar's test, bootstrap CI |
| "Limited to English/macOS" | Medium | True | Acknowledge as limitation |
| "Retrieval system not evaluated" | Medium | True | Either evaluate or remove claims |
| "No comparison to fine-tuning" | Medium | True | Acknowledge; position as prompt-only |
| "Engineering contribution, not research" | Low | Partially true | Emphasize empirical findings |
3.2 Weaknesses We CANNOT Fix Before Submission¶
- Fundamental novelty - Demo-conditioning is not architecturally novel
- Benchmark saturation - If WAA shows <20% improvement, contribution weakens
- Single-domain focus - GUI automation is narrow; no multi-domain transfer
3.3 Weaknesses We CAN Fix¶
- Benchmark coverage - Run complete WAA evaluation (1-2 weeks)
- Multi-model comparison - Add GPT-4V, Gemini (1 week)
- Statistical rigor - Add proper tests (1-2 days)
- Diverse first actions - Design new task set (1 week)
- Episode success - Extend evaluation (1 week)
4. Required Experiments for Defensible Claims¶
4.1 Minimum Viable Experiments (for Workshop Paper)¶
| Experiment | Tasks | Models | Trials/Task | Total Runs | Effort |
|---|---|---|---|---|---|
| WAA zero-shot baseline | 20 | 2 | 3 | 120 | 1 week |
| WAA demo-conditioned | 20 | 2 | 3 | 120 | 1 week |
| Total | 20 | 2 | 6 | 240 | 2 weeks |
Why 3 trials per task? - GUI actions have stochasticity (model sampling, UI timing) - Enables variance estimation and significance testing - Standard practice in agent evaluation literature
4.2 Full Conference Paper Requirements¶
| Experiment | Tasks | Models | Trials | Total Runs | Effort |
|---|---|---|---|---|---|
| WAA evaluation | 50+ | 3 | 3 | 450+ | 3 weeks |
| WebArena evaluation | 100+ | 2 | 3 | 600+ | 4 weeks |
| Ablation: demo format | 20 | 1 | 3 | 60 | 1 week |
| Ablation: demo length | 20 | 1 | 3 | 60 | 1 week |
| Ablation: # demos (k=1,3,5) | 20 | 1 | 3 | 180 | 2 weeks |
| Cross-task transfer | 20 | 1 | 3 | 60 | 1 week |
| Total | ~230 | 3-5 | 3+ | ~1500 | 10-12 weeks |
4.3 Essential Ablations¶
- Demo format ablation
- Full trace (screenshot descriptions + actions + results)
- Behavior-only (actions + results)
-
Action-only (just the action sequence)
-
Demo relevance ablation
- Exact-match demo (same task)
- Same-domain demo (e.g., any Settings task)
- Cross-domain demo (e.g., Browser demo for Settings task)
-
Random demo
-
Number of demos (k)
- k=1, 3, 5
- Does more demos help, or just add noise?
4.4 Baselines We MUST Compare Against¶
| Baseline | Description | Why Essential |
|---|---|---|
| Zero-shot instruction only | No demo, just task description | Primary comparison |
| Zero-shot + CoT | "Think step by step" | Fair comparison to prompting methods |
| Few-shot examples (text) | Text-only examples, no screenshots | Isolate visual contribution |
| SOTA on WAA | GPT-5.1 + OmniParser (~19.5%) | Establish relative performance |
| Random policy | Random clicks | Sanity check |
5. Statistical Rigor Requirements¶
5.1 Required Statistical Tests¶
| Test | Purpose | When to Use |
|---|---|---|
| McNemar's test | Paired comparison of binary outcomes | Zero-shot vs demo on same tasks |
| Bootstrap confidence intervals | Uncertainty estimation | All accuracy metrics |
| Effect size (Cohen's h) | Practical significance | Accompany p-values |
| Bonferroni correction | Multiple comparisons | When testing multiple models/conditions |
5.2 Minimum Sample Sizes¶
For detecting a 20 percentage point improvement with 80% power (alpha=0.05): - Per-condition: n >= 39 tasks (we have 45, sufficient) - With 3 trials per task: 39 x 3 = 117 total observations
For detecting a 10 percentage point improvement: - Per-condition: n >= 199 tasks (we do NOT have this) - Implication: If effect is smaller than expected, we may be underpowered
5.3 Reporting Standards¶
Every result table must include: 1. Mean accuracy 2. Standard deviation (across trials) 3. 95% confidence interval 4. Sample size (n) 5. Statistical test and p-value for key comparisons
Example:
| Condition | Accuracy | 95% CI | p-value (vs zero-shot) |
|-----------|----------|--------|------------------------|
| Zero-shot | 33.3% | [22.1, 46.0] | - |
| Demo-conditioned | 68.9% | [55.7, 80.1] | p<0.001 (McNemar) |
6. Related Work Gap Analysis¶
6.1 Papers We MUST Cite¶
GUI Agents & Benchmarks: 1. Bonatti et al. (2024) - Windows Agent Arena 2. Zhou et al. (2023) - WebArena 3. Xie et al. (2024) - OSWorld 4. Cheng et al. (2024) - SeeClick 5. Kim et al. (2024) - Crab benchmark 6. Gur et al. (2024) - WebAgent
VLM-based Agents: 7. Wang et al. (2024) - Mobile-Agent 8. Zhang et al. (2024) - UFO 9. Lu et al. (2024) - WebVoyager 10. Anthropic (2024) - Claude Computer Use
Programming by Demonstration: 11. Li et al. (2023) - UINav 12. Li et al. (2017) - SUGILITE 13. Cypher et al. (1993) - Watch What I Do (foundational PbD text)
Visual Grounding: 14. Chen et al. (2024) - OmniParser 15. Yang et al. (2023) - Set-of-Marks
Few-shot Prompting & RAG: 16. Brown et al. (2020) - GPT-3 few-shot 17. Wei et al. (2022) - Chain-of-thought 18. Lewis et al. (2020) - RAG
6.2 Potential Reviewers¶
Based on related work, likely reviewers include researchers from: - Microsoft Research (WAA, UFO, OmniParser teams) - Google DeepMind (WebAgent, PaLM teams) - CMU HCII (SUGILITE, UINav teams) - Allen Institute for AI (general VLM agents) - Stanford HAI (human-AI interaction)
Implication: Paper must respectfully position against UFO, SeeClick, and other Microsoft/Google work.
6.3 How We Differ From Prior Work¶
| Prior Work | Their Approach | Our Difference |
|---|---|---|
| UINav | Referee model for demo quality | We don't evaluate demo quality |
| SUGILITE | NL + GUI disambiguation | We use full VLM reasoning |
| UFO | Dual-agent architecture | We use single VLM with demo context |
| WebVoyager | Web-specific agent | We target desktop applications |
| Claude Computer Use | Production agent, no demos | We add demo conditioning |
Honest assessment: The difference from Claude Computer Use is simply "add a demo to the prompt." This is the core contribution, and we must own it.
7. Venue Fit Analysis¶
7.1 Realistic Venue Assessment¶
| Venue | Fit | Honest Chance | Rationale |
|---|---|---|---|
| NeurIPS main track | Poor | <20% | Contribution too incremental for main track |
| NeurIPS Datasets & Benchmarks | Poor | N/A | We don't propose a new benchmark |
| ICML main track | Poor | <20% | Same as NeurIPS |
| ICLR main track | Poor | <20% | Needs stronger learning contribution |
| CHI main track | Moderate | 30-40% | Good fit IF we add user study |
| UIST main track | Good | 40-50% | Systems + empirical evaluation |
| ACL/EMNLP | Poor | <20% | Not sufficiently NLP-focused |
| AAAI | Moderate | 30-40% | More accepting of applied work |
| LLM Agents Workshop (NeurIPS) | Excellent | 60-70% | Perfect scope and contribution level |
| CHI Late-Breaking Work | Excellent | 70%+ | Low barrier, good fit |
| UIST Demo Track | Excellent | 60-70% | Live demo is compelling |
7.2 Recommended Strategy¶
Phase 1 (Immediate): Target LLM Agents Workshop @ NeurIPS 2026 or ICML 2026 - Deadline: ~3 months before conference - Page limit: 4-8 pages - Contribution bar: Lower than main track - Allows us to establish priority and get feedback
Phase 2 (If workshop goes well): Expand to CHI 2027 or UIST 2026 - Add user study (n=20-30) - Expand benchmark coverage - 10-page full paper
Phase 3 (Long shot): Only pursue NeurIPS/ICML main track IF: - WAA shows >30pp improvement over SOTA - We discover unexpected insights during analysis - Reviewers at workshop suggest main-track potential
7.3 Venue-Specific Requirements¶
For CHI acceptance: - User study with statistical analysis (n >= 20) - Qualitative analysis (interviews, think-aloud) - Discussion of implications for HCI - Ethical considerations
For Workshop acceptance: - Clear empirical contribution - Reproducible experiments - Honest limitations discussion - Interesting future directions
8. Realistic Timeline¶
8.1 Minimum Viable Timeline (Workshop Paper)¶
| Week | Tasks | Dependencies |
|---|---|---|
| 1-2 | Fix WAA environment, run clean baseline | VM stable |
| 3-4 | Run demo-conditioned WAA experiments | Baseline done |
| 5 | Statistical analysis, write results | Experiments done |
| 6 | Write introduction, related work | - |
| 7 | Internal review, revisions | Draft done |
| 8 | Submit to workshop | - |
Total: 8 weeks from today to submission-ready
8.2 Realistic Timeline (CHI Full Paper)¶
| Month | Tasks |
|---|---|
| 1-2 | Complete WAA + WebArena experiments |
| 3 | Design and run user study |
| 4 | Analyze user study, write draft |
| 5 | Internal review, revisions |
| 6 | Submit to CHI |
Total: 6 months (CHI 2027 deadline: ~September 2026)
8.3 Timeline Risks¶
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| WAA environment issues | High | 2-3 week delay | Have backup mock evaluation |
| Results don't match expectations | Medium | May kill paper | Pivot to analysis/negative results |
| API rate limits/costs | Medium | 1-2 week delay | Budget API costs upfront |
| Co-author availability | Medium | Variable | Start writing in parallel |
9. Risk Mitigation¶
9.1 If WAA Results Are Disappointing¶
Scenario: Demo-conditioning shows <10pp improvement on WAA
Options: 1. Pivot to analysis paper: Why doesn't demo-conditioning help on WAA? 2. Focus on narrow success cases: Which task categories benefit most? 3. Negative results paper: "When Demonstrations Don't Help" 4. Workshop-only publication: Present findings, get feedback
9.2 If Experiments Take Too Long¶
Scenario: Cannot complete experiments before deadline
Options: 1. Reduce scope: Fewer tasks, fewer models, one benchmark 2. Workshop paper first: Lower bar, establish priority 3. arXiv preprint: Stake claim while continuing experiments 4. Target later deadline: Better to submit complete work
9.3 If Reviewers Reject on Novelty¶
Mitigation in paper: - Explicitly position as empirical study, not algorithmic contribution - Emphasize the magnitude of improvement and practical value - Provide extensive ablations to show what matters - Open-source all code and data
10. Action Items¶
10.1 Immediate (This Week)¶
- Fix WAA environment - Resolve Navi agent bugs or switch to API agent
- Define exact task set - Select 20+ WAA tasks with diverse first actions
- Budget API costs - Estimate cost for 500+ API calls
10.2 Short-Term (Weeks 2-4)¶
- Run zero-shot baseline - 20 tasks x 2 models x 3 trials
- Write demos for all tasks - Using behavior-only format
- Run demo-conditioned evaluation - Same tasks, with demos
- Statistical analysis - McNemar's test, bootstrap CIs
10.3 Medium-Term (Weeks 5-8)¶
- Write workshop paper - 4-6 pages, focus on empirical results
- Create figures - Accuracy comparison, demo format examples
- Internal review - Get feedback from 2-3 people
- Submit to workshop - LLM Agents Workshop or similar
10.4 Long-Term (Months 3-6)¶
- Expand to WebArena - Additional benchmark coverage
- User study design - For CHI/UIST submission
- Run user study - n=20-30 participants
- Write full paper - 10 pages for CHI/UIST
11. Path to Main Track Publication (Parallel Track)¶
This section provides a rigorous assessment of what would be required to publish in a main track venue (NeurIPS, ICML, ICLR) rather than a workshop. This is a parallel track that requires substantially more investment.
11.1 Honest Assessment: Why Current Work is Workshop-Level¶
Our current contribution is fundamentally prompt engineering, not machine learning research. While valuable for practitioners, this positions us poorly for ML venues that expect learned components, theoretical insights, or architectural innovations.
Table: Anticipated Reviewer Concerns for Main Track Submission
| Concern | Severity | Our Current Status | What Main Track Requires |
|---|---|---|---|
| No learned component | Critical | True - retrieval uses heuristic similarity | Train retrieval end-to-end for downstream task |
| Single demo format | High | True - behavior-only format hardcoded | Learn optimal format/compression |
| Heuristic retrieval (BM25/embedding) | High | True - not optimized for action accuracy | Retrieval that optimizes task success, not similarity |
| Limited evaluation | High | 45 tasks, 1 model, 1 platform | 200+ tasks, 3+ models, 2+ benchmarks |
| No comparison to fine-tuning | High | True | Show when prompting beats/complements fine-tuning |
| No theoretical analysis | Medium | True - purely empirical | Information-theoretic or PAC-learning analysis |
| Engineering focus | Medium | True - system building, not research | Clear algorithmic or theoretical contribution |
| No ablation of demo components | Medium | Partial | Systematic ablation with significance tests |
Bottom line: A main track reviewer at NeurIPS/ICML will likely say: "This is a well-executed engineering project with an empirical evaluation, but where is the research contribution? Adding demos to prompts is not novel."
11.2 Required Technical Contributions (Options to Elevate)¶
To elevate from workshop to main track, we need at least ONE of the following technical contributions:
Option A: Learned Demo Retrieval (RECOMMENDED)¶
Effort: 2-3 months | Risk: Medium | Novelty: High
Core idea: Train the retrieval system to optimize action accuracy, not semantic similarity.
Why this works: Current retrieval uses off-the-shelf embeddings (CLIP, text similarity) that optimize for semantic match. But the best demo for a task may not be the most semantically similar - it may be one that provides the right procedural template or spatial priors.
Technical approach: 1. Collect retrieval training data: (query, demo, action_accuracy) tuples 2. Train retrieval scorer to predict action accuracy given (query, demo) pair 3. Use contrastive learning: demos that help should score higher than demos that don't 4. Evaluate: Does learned retrieval outperform heuristic retrieval?
Key experiments: - Retrieval recall@k vs action accuracy correlation - Learned vs heuristic retrieval on held-out tasks - Analysis of what the model learns (which demo features matter?)
Related work to cite: - REALM (Guu et al., 2020) - Retrieval-augmented language model pretraining - Atlas (Izacard et al., 2022) - Few-shot learning with retrieval - DocPrompting (Zhou et al., 2022) - Retrieve docs for code generation
Why reviewers would accept: "First demonstration that learned retrieval improves demo-conditioned GUI agents, with analysis of what retrieval features matter."
Option B: Learned Prompt Synthesis¶
Effort: 3-4 months | Risk: Medium-High | Novelty: High
Core idea: Learn to synthesize optimal demo prompts rather than using fixed templates.
Technical approach: 1. Define prompt template space (what to include, how to format, compression level) 2. Use LLM-in-the-loop optimization (APE-style) to find optimal templates 3. Alternatively, train a small model to select/compress demo content 4. Evaluate: Does learned synthesis outperform hand-crafted templates?
Key experiments: - Template ablation with learned selection - Compression ratio vs accuracy tradeoff - Cross-task transfer of learned templates
Related work to cite: - APE (Zhou et al., 2022) - Automatic prompt engineering - DSPy (Khattab et al., 2023) - Programmatic prompt optimization - PromptBreeder (Fernando et al., 2023) - Self-referential prompt evolution
Why reviewers would accept: "Novel prompt synthesis method that learns to format demonstrations for maximal downstream utility."
Option C: Behavioral Cloning with Demo-Augmentation¶
Effort: 4-6 months | Risk: High | Novelty: Very High
Core idea: Fine-tune a VLM using demonstration-augmented behavioral cloning.
Technical approach: 1. Collect behavioral cloning dataset: (screenshot, task, action) tuples 2. Augment each example with retrieved demonstration context 3. Fine-tune VLM with demo in context vs without 4. Compare: Does demo-augmented fine-tuning outperform standard fine-tuning?
Key experiments: - Fine-tuning with/without demo augmentation - Sample efficiency: Do demos reduce required training data? - Analysis of attention patterns: Does the model attend to demos?
Related work to cite: - CogAgent (Hong et al., 2023) - GUI agent fine-tuning - SeeClick (Cheng et al., 2024) - Visual grounding for GUI - RT-2 (Brohan et al., 2023) - Vision-language-action models
Why reviewers would accept: "First demonstration that demo-augmentation improves fine-tuned GUI agents, with analysis of when prompting vs fine-tuning is preferred."
Caveat: This requires significant compute ($2-5k GPU, 4-6 weeks training) and expertise in VLM fine-tuning.
Option D: Theoretical Analysis¶
Effort: 2-3 months | Risk: High | Novelty: Medium
Core idea: Provide theoretical analysis of why demonstrations help GUI agents.
Technical approach: 1. Information-theoretic analysis: How much information do demos provide? 2. PAC-learning analysis: Sample complexity with/without demos 3. Formal model of GUI task space and demo utility
Key contributions: - Theoretical bound on demo utility - Characterization of when demos help vs hurt - Connection to few-shot learning theory
Related work to cite: - Brown et al. (2020) - GPT-3 few-shot capabilities - Xie et al. (2021) - Why in-context learning works - Min et al. (2022) - Rethinking demonstration role
Why reviewers would accept: "Theoretical understanding of demonstration utility for GUI agents, with empirical validation."
Caveat: Requires theoretical ML expertise; risk of disconnect between theory and practice.
11.3 Additional Experiments Required¶
Beyond the technical contribution, main track requires substantially more empirical evidence:
Benchmark Coverage: | Benchmark | Tasks Required | Current Status | Effort | |-----------|---------------|----------------|--------| | Windows Agent Arena (WAA) | 50+ tasks | 8 tasks (incomplete) | 3-4 weeks | | WebArena | 100+ tasks | 0 tasks | 4-6 weeks | | OSWorld (optional) | 50+ tasks | 0 tasks | 4-6 weeks |
Evaluation Metrics: - First-action accuracy: Already measured, but on non-standard tasks - Episode success rate: Not measured - REQUIRED for main track - Step efficiency: Actions per successful task - Grounding accuracy: Correct element identification rate
Multi-Model Comparison: | Model | Priority | Status | |-------|----------|--------| | Claude Sonnet 4.5 | Required | Tested | | GPT-4V | Required | Not tested | | Gemini 1.5 Pro | Required | Not tested | | Qwen-VL | Nice to have | Not tested | | Open-source (LLaVA) | Nice to have | Not tested |
Ablation Studies: 1. Demo format: full trace vs behavior-only vs action-only 2. Number of demos: k=1, 3, 5, 10 3. Demo relevance: exact match vs same-domain vs random 4. Demo recency: fresh demos vs stale demos 5. Model scale: Does demo benefit scale with model size?
Statistical Requirements: - 3+ seeds per experiment for variance estimation - 95% confidence intervals on all metrics - Statistical significance tests (McNemar's, permutation tests) - Effect sizes (Cohen's h, odds ratios)
11.4 Timeline and Resources¶
Minimum timeline for main track submission:
| Phase | Duration | Activities |
|---|---|---|
| Phase 1: Technical contribution | 2-4 months | Implement learned retrieval or prompt synthesis |
| Phase 2: Large-scale evaluation | 2-3 months | WAA (50+), WebArena (100+), multi-model |
| Phase 3: Analysis & writing | 1-2 months | Ablations, significance tests, paper writing |
| Total | 6-9 months | From start to submission-ready |
Resource requirements:
| Resource | Estimate | Notes |
|---|---|---|
| Dedicated researchers | 1-2 FTE | Cannot be done part-time |
| GPU compute | $2-5k | For fine-tuning experiments (Option C) |
| API credits | $1-3k | Multi-model evaluation at scale |
| Azure VM (WAA) | $200-500 | Extended evaluation runs |
| Human annotation | $500-1k | Demo quality labels, retrieval training data |
Total estimated cost: $5-10k (excluding researcher time)
11.5 Honest Recommendation¶
For a small team with limited resources: - Focus on workshop paper. The workshop contribution is solid and achievable. - Do NOT attempt main track unless you can dedicate 1-2 researchers full-time for 6+ months. - A rejected main track submission wastes 6-9 months and demoralizes the team.
For a team with dedicated resources: - Pursue Option A (Learned Retrieval) as the most tractable path to main track. - This adds a clear learned component while building on existing infrastructure. - Expected timeline: 6-7 months to submission-ready. - Honest acceptance probability: 25-35% at NeurIPS/ICML (still challenging).
Do NOT attempt main track if: - You cannot dedicate 1-2 researchers full-time to this project - You do not have ML research expertise (vs engineering expertise) - You need a publication in < 6 months - You are not prepared for likely rejection and iteration
The workshop path is not a consolation prize. Top workshops at NeurIPS/ICML have excellent visibility, lead to valuable feedback, and establish priority for your ideas. Many impactful papers started as workshop papers.
11.6 Additional References for Main Track¶
Retrieval-Augmented Learning: - Guu, K., Lee, K., Tung, Z., Pasupat, P., & Chang, M. W. (2020). REALM: Retrieval-augmented language model pre-training. ICML 2020. - Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., ... & Grave, E. (2022). Atlas: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299. - Zhou, S., Alon, U., Xu, F. F., Wang, Z., Jiang, Z., & Neubig, G. (2022). DocPrompting: Generating code by retrieving the docs. ICLR 2023.
Automatic Prompt Engineering: - Zhou, Y., Muresanu, A. I., Han, Z., Paster, K., Pitis, S., Chan, H., & Ba, J. (2022). Large language models are human-level prompt engineers. ICLR 2023. - Khattab, O., Santhanam, K., Li, X. L., Hall, D., Liang, P., Potts, C., & Zaharia, M. (2023). DSPy: Compiling declarative language model calls into self-improving pipelines. arXiv preprint arXiv:2310.03714. - Fernando, C., Banarse, D., Michalewski, H., Osindero, S., & Rocktäschel, T. (2023). PromptBreeder: Self-referential self-improvement via prompt evolution. arXiv preprint arXiv:2309.16797.
GUI Agent Fine-Tuning: - Hong, W., Wang, W., Lv, Q., Xu, J., Yu, W., Ji, J., ... & Tang, J. (2023). CogAgent: A visual language model for GUI agents. arXiv preprint arXiv:2312.08914. - Cheng, K., Sun, Q., Chu, Y., Xu, F., Li, Y., Zhang, J., & Wu, Z. (2024). SeeClick: Harnessing GUI grounding for advanced visual GUI agents. arXiv preprint arXiv:2401.10935. - Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., ... & Zitkovich, B. (2023). RT-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818.
Appendix A: Honest Framing for Paper¶
Abstract Template¶
We present an empirical study of demonstration-conditioned prompting for vision-language model (VLM) GUI agents. While prior work has explored VLMs for GUI automation, we systematically evaluate the effect of including human demonstrations in the prompt. Across N tasks on the Windows Agent Arena benchmark, we find that demo-conditioning improves task success rate from X% to Y% (p < 0.01), representing a Z percentage point improvement. We analyze which task categories benefit most and identify limitations where demonstrations do not help. Our findings suggest that simple prompting interventions can substantially improve GUI agent performance without fine-tuning, and we release our code and demo library to facilitate future research.
Title Options (Honest)¶
- "Does Showing Help? An Empirical Study of Demo-Conditioned GUI Agents"
- "From Instructions to Demonstrations: Improving VLM GUI Agents Through Example"
- "Show, Don't Just Tell: The Value of Demonstrations for GUI Automation"
Contribution Statement Template¶
Our contributions are: 1. Empirical study: We conduct the first systematic evaluation of demo-conditioning for VLM GUI agents across N tasks and M models 2. Analysis: We identify which task categories and UI patterns benefit most from demonstrations 3. Practical method: We provide an open-source implementation with demo retrieval capabilities 4. Dataset: We release a library of K human demonstrations for GUI tasks
Appendix B: Cost Estimates¶
API Costs (Conservative)¶
| Model | Input ($/1M) | Output ($/1M) | Est. calls | Est. cost |
|---|---|---|---|---|
| Claude Sonnet 4.5 | $3 | $15 | 1000 | ~$50-100 |
| GPT-4V | $10 | $30 | 1000 | ~$100-200 |
| Gemini Pro Vision | $0.25 | $0.50 | 1000 | ~$10-20 |
| Total | - | - | 3000 | ~$200-400 |
Compute Costs (Azure)¶
| Resource | Rate | Hours | Cost |
|---|---|---|---|
| D4ds_v5 (WAA VM) | $0.19/hr | 100 | ~$20 |
| Storage | $0.02/GB | 100GB | ~$2 |
| Total | - | - | ~$25 |
Appendix C: Reviewer Response Templates¶
"This is just few-shot prompting"¶
We agree that demo-conditioning can be viewed as a form of few-shot prompting. However, GUI automation presents unique challenges compared to standard NLP tasks: (1) visual grounding requires understanding spatial relationships in screenshots, (2) multi-step tasks require maintaining procedural context, and (3) UI variations across platforms and applications create distribution shift. Our contribution is demonstrating that demonstrations substantially help in this domain (X% -> Y%), characterizing when they help (task category analysis), and providing practical infrastructure (demo retrieval, open-source code) for practitioners.
"Sample size is too small"¶
We acknowledge this limitation. With n=N tasks and 3 trials each, we are powered to detect a 20pp effect at 80% power. Our observed effect of Zpp is well above this threshold, and our statistical tests (McNemar's, bootstrap CI) confirm significance. We have expanded our task set to N tasks for the camera-ready version.
"Results may not generalize beyond tested benchmarks"¶
This is a valid concern. We have focused on WAA as it represents realistic enterprise desktop tasks. In future work, we plan to evaluate on WebArena and OSWorld to assess cross-benchmark generalization. However, we note that the WAA benchmark itself covers diverse applications (browser, office, file management, settings) and our positive results across these categories suggest some generalizability within desktop environments.
Last updated: January 2026 This is a living document. Update as experiments complete and understanding deepens.