What are the key points?

EnterpriseOps-Gym evaluates AI agents across 1,150 tasks in HR, IT, and Customer Service Top models achieve only 37.4% success, highlighting major gaps in long-horizon strategic planning Providing human-curated strategic plans increases agent performance by up to 35 percentage points

New Benchmark Reveals Enterprise AI Agents Struggle with Planning

•EnterpriseOps-Gym evaluates AI agents across 1,150 tasks in HR, IT, and Customer Service
•Top models achieve only 37.4% success, highlighting major gaps in long-horizon strategic planning
•Providing human-curated strategic plans increases agent performance by up to 35 percentage points

The transition from AI as a simple chatbot to an autonomous worker is proving more difficult than anticipated. Researchers from ServiceNow-AI have introduced EnterpriseOps-Gym, a rigorous benchmark designed to simulate the messy, stateful reality of corporate environments. Unlike static tests, this sandbox features over 500 functional tools and hundreds of database tables, forcing models to manage complex, multi-step workflows across departments like HR and IT.

The results are a sobering reality check for the industry. Even the most advanced models struggled significantly, with the top-performing model reaching a success rate of just 37.4%. The primary bottleneck isn't a lack of information, but a failure in strategic reasoning—the ability to map out a long-term sequence of actions to reach a goal. When researchers provided the agents with "oracle" plans (human-designed step-by-step guides), success rates jumped by 14 to 35 percentage points, suggesting that models currently lack the foresight required for professional autonomy.

Perhaps more concerning is the agents' inability to say "no." In the study, models frequently attempted tasks they lacked the authority or data to complete, leading to unintended and potentially harmful side effects within the simulated enterprise. This highlights a critical safety gap: if an agent cannot recognize its own limitations or follow strict access protocols, it remains too risky for deployment in sensitive business operations.

The transition from AI as a simple chatbot to an autonomous worker is proving more difficult than anticipated. Researchers from ServiceNow-AI have introduced EnterpriseOps-Gym, a rigorous benchmark designed to simulate the messy, stateful reality of corporate environments. Unlike static tests, this sandbox features over 500 functional tools and hundreds of database tables, forcing models to manage complex, multi-step workflows across departments like HR and IT.

The results are a sobering reality check for the industry. Even the most advanced models struggled significantly, with the top-performing model reaching a success rate of just 37.4%. The primary bottleneck isn't a lack of information, but a failure in strategic reasoning—the ability to map out a long-term sequence of actions to reach a goal. When researchers provided the agents with "oracle" plans (human-designed step-by-step guides), success rates jumped by 14 to 35 percentage points, suggesting that models currently lack the foresight required for professional autonomy.

Perhaps more concerning is the agents' inability to say "no." In the study, models frequently attempted tasks they lacked the authority or data to complete, leading to unintended and potentially harmful side effects within the simulated enterprise. This highlights a critical safety gap: if an agent cannot recognize its own limitations or follow strict access protocols, it remains too risky for deployment in sensitive business operations.

New Benchmark Reveals Enterprise AI Agents Struggle with Planning

Tags