State-of-the-Art Performance with 10x Less Data


Grounding Computer Use Agents on Human Demonstrations

Achieving SOTA desktop grounding with 700K samples vs 9M+ in prior work
Dense supervision • Expert annotations • Cross-platform generalization

60.3%
Desktop Avg
85.8%
Corss-Platform Avg
50.6%
OSWorld Agent
3.56M+
Human Annotations
56K
Screenshots
87
Applications
10x
Data Efficient

Outstanding Performance
Across All Benchmarks

GroundNext achieves best-in-class results with 10x less training data

Desktop Grounding Benchmarks

Trained exclusively on desktop data, yet achieves SOTA across all desktop benchmarks

Model ScreenSpot-Pro OSWorld-G UI-Vision Desktop Avg
Qwen2.5-VL-7B 29.7 42.7 16.5 29.6
UI-TARS-72B 72B 38.1 57.1 25.5 40.2
GroundNext-3B 49.8 64.2 62.1 58.7 +46.0%
GroundNext-7B 52.9 67.7 60.3 60.3 +50.0%

Cross-Platform Generalization

Strong zero-shot transfer to mobile and web despite desktop-only training

Model MMBench-GUI ScreenSpot-v2 Cross-Platform Avg
Qwen2.5-VL-7B 33.9 88.8 61.4
UI-TARS-72B 72B 74.3 90.3 82.3
GroundNext-3B 77.1 88.5 82.8 +0.6%
GroundNext-7B 81.1 90.4 85.8 +4.3%

Agentic Performance on OSWorld

GroundNext-3B combined with o3 achieves competitive performance with larger specialized models

Model OS Office Daily Pro Workflow Overall
OpenAI o3 62.5 14.5 21.4 38.8 16.5 23.0
CUA 23.9 34.6 55.1 18.3 18.3 31.4
OpenCUA-7B 41.7 22.5 35.4 46.3 9.8 26.5
OpenCUA-72B 58.3 47.0 53.8 73.5 20.4 46.1
UI-TARS-1.5-7B 33.3 29.9 37.9 53.1 9.1 29.6
JEDI-7B w/ o3 50.0 46.1 61.9 75.5 35.3 51.0
GroundNext-3B w/ o3 62.5 47.0 55.0 73.5 36.5 50.6

Task categories: OS (operating system), Office (productivity apps), Daily (common tasks), Pro (professional software), Workflow (multi-apps)

Key Achievements

10x Data Efficiency

SOTA with 700K samples vs 9M+ in prior work

Cross-Domain Excellence

Desktop training generalizes to mobile & web

Fine-Grained Grounding

Superior on small UI elements and complex workflows

Why GroundCUA?

High-Quality Desktop Dataset

  • Dense, expert-annotated supervision
  • Coverage of almost every visible element
  • Fine-grained categories for 50% of elements

Efficient Model Training

  • SOTA with 700K vs 9M datapoints
  • Two-stage: SFT + Reinforcement Learning
  • Models at 3B and 7B scales

Cross-Platform Generalization

  • Desktop, mobile, and web environments
  • Evaluation on five challenging benchmarks
  • Robust despite desktop-only training

GroundCUA Dataset

The largest and most densely annotated human-verified dataset for desktop grounding

Distribution across 87 applications

Dataset Distribution
56K
SCREENSHOTS

Densely labeled keyframes from task demonstrations

3.56M
ANNOTATIONS

Human-verified bounding boxes with textual labels

87
APPLICATIONS

Across 12 categories: office, creative, dev, scientific

64
AVG ELEMENTS/IMAGE

Maximum density with up to 542 elements

0.4-7MP
RESOLUTION

High-quality images with clear visibility

10K
TASKS

Diverse human-executed computer use tasks

Access the Dataset

Download GroundCUA and start building better computer-use agents

GroundNext Models

State-of-the-art vision-language models at 3B and 7B scales

Two-Stage Training Pipeline

Base Model

Qwen2.5-VL
3B & 7B

SFT Stage

700K samples
GroundCUA

RL Stage

10K samples
RLOO

GroundNext

SOTA
Performance

Download the Models

Access GroundNext-3B and GroundNext-7B for your research

Sample Applications

Examples from our dataset covering diverse desktop applications

×

Authors

A collaboration across leading AI research institutions

Aarash Feizi1,2,4*
Shravan Nayak1,3*
Xiangru Jian5
Kevin Qinghong Lin6
Kaixin Li6
Rabiul Awal1,3,4
Xing Han Lù1,2
Johan Obando-Ceron1,3
Juan A. Rodriguez1,8
Nicolas Chapados4
David Vazquez4
Adriana Romero-Soriano1,2
Reihaneh Rabbany1,2
Perouz Taslakian4
Christopher Pal4
Spandana Gella4
Sai Rajeswar4,1,3
*Equal contribution

Affiliations

1 Mila - Quebec AI Institute
2 McGill University
3 Université de Montréal
4 ServiceNow Research
5 University of Waterloo
6 National University of Singapore
8 École de Technologie Supérieure

Citation

@misc{feizi2025groundingcomputeruseagents,
      title={Grounding Computer Use Agents on Human Demonstrations}, 
      author={Aarash Feizi and Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Kaixin Li and Rabiul Awal and Xing Han Lù and Johan Obando-Ceron and Juan A. Rodriguez and Nicolas Chapados and David Vazquez and Adriana Romero-Soriano and Reihaneh Rabbany and Perouz Taslakian and Christopher Pal and Spandana Gella and Sai Rajeswar},
      year={2025},
      eprint={2511.07332},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2511.07332}
}