Achieving SOTA desktop grounding with 700K samples vs 9M+ in prior work
Dense supervision • Expert annotations • Cross-platform generalization
GroundNext achieves best-in-class results with 10x less training data
Trained exclusively on desktop data, yet achieves SOTA across all desktop benchmarks
| Model | ScreenSpot-Pro | OSWorld-G | UI-Vision | Desktop Avg |
|---|---|---|---|---|
| Qwen2.5-VL-7B | 29.7 | 42.7 | 16.5 | 29.6 |
| UI-TARS-72B 72B | 38.1 | 57.1 | 25.5 | 40.2 |
| GroundNext-3B | 49.8 | 64.2 | 62.1 | 58.7 +46.0% |
| GroundNext-7B | 52.9 | 67.7 | 60.3 | 60.3 +50.0% |
Strong zero-shot transfer to mobile and web despite desktop-only training
| Model | MMBench-GUI | ScreenSpot-v2 | Cross-Platform Avg |
|---|---|---|---|
| Qwen2.5-VL-7B | 33.9 | 88.8 | 61.4 |
| UI-TARS-72B 72B | 74.3 | 90.3 | 82.3 |
| GroundNext-3B | 77.1 | 88.5 | 82.8 +0.6% |
| GroundNext-7B | 81.1 | 90.4 | 85.8 +4.3% |
GroundNext-3B combined with o3 achieves competitive performance with larger specialized models
| Model | OS | Office | Daily | Pro | Workflow | Overall |
|---|---|---|---|---|---|---|
| OpenAI o3 | 62.5 | 14.5 | 21.4 | 38.8 | 16.5 | 23.0 |
| CUA | 23.9 | 34.6 | 55.1 | 18.3 | 18.3 | 31.4 |
| OpenCUA-7B | 41.7 | 22.5 | 35.4 | 46.3 | 9.8 | 26.5 |
| OpenCUA-72B | 58.3 | 47.0 | 53.8 | 73.5 | 20.4 | 46.1 |
| UI-TARS-1.5-7B | 33.3 | 29.9 | 37.9 | 53.1 | 9.1 | 29.6 |
| JEDI-7B w/ o3 | 50.0 | 46.1 | 61.9 | 75.5 | 35.3 | 51.0 |
| GroundNext-3B w/ o3 | 62.5 | 47.0 | 55.0 | 73.5 | 36.5 | 50.6 |
Task categories: OS (operating system), Office (productivity apps), Daily (common tasks), Pro (professional software), Workflow (multi-apps)
SOTA with 700K samples vs 9M+ in prior work
Desktop training generalizes to mobile & web
Superior on small UI elements and complex workflows
The largest and most densely annotated human-verified dataset for desktop grounding
Distribution across 87 applications
Densely labeled keyframes from task demonstrations
Human-verified bounding boxes with textual labels
Across 12 categories: office, creative, dev, scientific
Maximum density with up to 542 elements
High-quality images with clear visibility
Diverse human-executed computer use tasks
Download GroundCUA and start building better computer-use agents
Qwen2.5-VL
3B & 7B
700K samples
GroundCUA
10K samples
RLOO
SOTA
Performance
Access GroundNext-3B and GroundNext-7B for your research
Examples from our dataset covering diverse desktop applications
A collaboration across leading AI research institutions
@misc{feizi2025groundingcomputeruseagents,
title={Grounding Computer Use Agents on Human Demonstrations},
author={Aarash Feizi and Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Kaixin Li and Rabiul Awal and Xing Han Lù and Johan Obando-Ceron and Juan A. Rodriguez and Nicolas Chapados and David Vazquez and Adriana Romero-Soriano and Reihaneh Rabbany and Perouz Taslakian and Christopher Pal and Spandana Gella and Sai Rajeswar},
year={2025},
eprint={2511.07332},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2511.07332}
}