What are the key points?

METR updates its 'Time Horizon' benchmark, expanding to 228 tasks to measure autonomous capability growth Newest data shows AI autonomous capabilities doubling every seven months, with recent acceleration to 89 days Evaluation infrastructure transitions to the UK AI Security Institute's open-source Inspect framework for standardized testing

METR Updates AI Time Horizon Estimates for Autonomous Capabilities

•METR updates its 'Time Horizon' benchmark, expanding to 228 tasks to measure autonomous capability growth
•Newest data shows AI autonomous capabilities doubling every seven months, with recent acceleration to 89 days
•Evaluation infrastructure transitions to the UK AI Security Institute's open-source Inspect framework for standardized testing

The research nonprofit METR has released Time Horizon 1.1, a significant update to its methodology for measuring how long AI models can operate autonomously without human intervention (time horizons). By increasing their task suite from 170 to 228 and incorporating more "long-haul" tasks—those requiring eight or more human-equivalent hours—METR aims to capture the rapid ascent of frontier models. This methodology effectively measures how much work an AI agent can complete before it fails or requires human correction.

The findings reveal a relentless exponential trend. While the long-term doubling rate for autonomous capabilities sits at roughly seven months, recent data suggests this pace may be accelerating. In fact, since early 2024, the doubling time for top-tier models has compressed to approximately 89 days under the new TH1.1 metrics. This rapid growth underscores the shift from AI as a simple chatbot to an agentic tool capable of managing multi-step workflows across hours of execution.

The transition from METR’s internal "Vivaria" platform to the UK AI Security Institute’s open-source "Inspect" framework marks a move toward standardized safety testing. While some models like GPT-4o showed slight performance shifts between platforms, the overall trend remains consistent. As models like GPT-5 and Claude Opus 4.5 push the ceiling of current benchmarks, METR is prioritizing the development of even more complex, multi-day tasks to ensure evaluations remain relevant for the next generation of systems.

The research nonprofit METR has released Time Horizon 1.1, a significant update to its methodology for measuring how long AI models can operate autonomously without human intervention (time horizons). By increasing their task suite from 170 to 228 and incorporating more "long-haul" tasks—those requiring eight or more human-equivalent hours—METR aims to capture the rapid ascent of frontier models. This methodology effectively measures how much work an AI agent can complete before it fails or requires human correction.

The findings reveal a relentless exponential trend. While the long-term doubling rate for autonomous capabilities sits at roughly seven months, recent data suggests this pace may be accelerating. In fact, since early 2024, the doubling time for top-tier models has compressed to approximately 89 days under the new TH1.1 metrics. This rapid growth underscores the shift from AI as a simple chatbot to an agentic tool capable of managing multi-step workflows across hours of execution.

The transition from METR’s internal "Vivaria" platform to the UK AI Security Institute’s open-source "Inspect" framework marks a move toward standardized safety testing. While some models like GPT-4o showed slight performance shifts between platforms, the overall trend remains consistent. As models like GPT-5 and Claude Opus 4.5 push the ceiling of current benchmarks, METR is prioritizing the development of even more complex, multi-day tasks to ensure evaluations remain relevant for the next generation of systems.

METR Updates AI Time Horizon Estimates for Autonomous Capabilities

Tags