What are the key points?

ServiceNow introduces CUA-Suite, a dataset featuring 55 hours of expert video for training desktop-automation agents. VideoCUA captures 10,000 human tasks across 87 applications with continuous 30 fps screen recordings and cursor traces. Initial benchmarks show existing AI models struggle with desktop tasks, recording a significant 60% failure rate.

ServiceNow Unveils Massive Video Dataset for Computer-Use Agents

•ServiceNow introduces CUA-Suite, a dataset featuring 55 hours of expert video for training desktop-automation agents.
•VideoCUA captures 10,000 human tasks across 87 applications with continuous 30 fps screen recordings and cursor traces.
•Initial benchmarks show existing AI models struggle with desktop tasks, recording a significant 60% failure rate.

Training AI to navigate a computer screen like a human has long been hindered by a lack of high-quality data. Most existing datasets rely on static screenshots, which fail to capture the fluid movement of a cursor or the subtle animations of a menu opening.

To bridge this gap, researchers at ServiceNow have released CUA-Suite, an expansive ecosystem of over 6 million video frames dedicated to computer-use agents (CUAs). Unlike previous efforts, this dataset provides continuous 30-frame-per-second recordings of experts performing nearly 10,000 tasks across diverse professional software. By recording every kinematic cursor trace and visual transition, the project allows AI to learn the temporal dynamics of human interaction rather than just looking at the final result.

The suite also includes UI-Vision, a benchmark designed to test how well these agents can plan and execute actions in complex environments. Early tests are a wake-up call for the industry: even advanced foundation models failed roughly 60% of the time when faced with professional desktop applications. This data release aims to spark a new wave of research into visual world models and generalist screen parsing (grounding), moving us closer to truly autonomous digital assistants.

Training AI to navigate a computer screen like a human has long been hindered by a lack of high-quality data. Most existing datasets rely on static screenshots, which fail to capture the fluid movement of a cursor or the subtle animations of a menu opening.

To bridge this gap, researchers at ServiceNow have released CUA-Suite, an expansive ecosystem of over 6 million video frames dedicated to computer-use agents (CUAs). Unlike previous efforts, this dataset provides continuous 30-frame-per-second recordings of experts performing nearly 10,000 tasks across diverse professional software. By recording every kinematic cursor trace and visual transition, the project allows AI to learn the temporal dynamics of human interaction rather than just looking at the final result.

The suite also includes UI-Vision, a benchmark designed to test how well these agents can plan and execute actions in complex environments. Early tests are a wake-up call for the industry: even advanced foundation models failed roughly 60% of the time when faced with professional desktop applications. This data release aims to spark a new wave of research into visual world models and generalist screen parsing (grounding), moving us closer to truly autonomous digital assistants.

ServiceNow Unveils Massive Video Dataset for Computer-Use Agents

Tags