METR Evaluation Finds GPT-5.1-Codex-Max Poses Low Autonomy Risks
- •GPT-5.1-Codex-Max shows incremental autonomy gains with a 2-hour-40-minute human-equivalent success horizon.
- •METR reports no evidence of catastrophic risks like rogue replication or AI self-improvement capabilities.
- •Model performance followed historical scaling trends without showing signs of sandbagging or evaluation sabotage.
The Model Evaluation and Threat Research organization (METR) recently assessed OpenAI's GPT-5.1-Codex-Max to see if this Large Language Model could independently improve itself or replicate without human help. Their findings suggest that while the model improves over previous versions, it remains well below the threshold for catastrophic autonomy risks. The model achieved a "50% time horizon" of roughly 2 hours and 42 minutes, meaning it can successfully complete software tasks that would take a human that long to finish half the time.
The evaluation used specialized task suites like the Human-Calibrated Autonomy Software Tasks (HCAST), which are benchmarks that measure an agent's ability to navigate complex engineering environments. A primary concern for AI Safety researchers is "sandbagging," where an AI might deliberately hide its true capabilities to avoid being restricted. However, after analyzing the model's Chain-of-Thought—the step-by-step reasoning process it generates—METR found no evidence of such deceptive behavior or "reward hacking," where a model cheats the scoring system to get a high score.
OpenAI provided massive token budgets, but gains slowed down significantly after the first 5 million tokens. This suggests that simply adding more "thinking time" to current architectures yields diminishing returns. While the report concludes that immediate risks are low, researchers cautioned that breakthroughs in the next six months could still cause a sudden leap in autonomous capabilities.