Claude Opus 4.6 Leads Agentic Knowledge Work Benchmarks
- •Claude Opus 4.6 reaches 1606 Elo, surpassing GPT-5.2 by nearly 150 points.
- •The model uses adaptive thinking to complete complex workflows through shell and web access.
- •Increased token consumption makes Opus 4.6 the most expensive model for agentic tasks.
Anthropic's latest iteration, Claude Opus 4.6, has claimed the top spot on the GDPval-AA leaderboard, a rigorous benchmark designed to evaluate how AI models handle complex, multi-step real-world tasks. Developed by Artificial Analysis, this metric goes beyond simple text generation by testing models on their ability to execute agentic workflows—such as data analysis and video production scheduling—within an interactive loop using shell access and web browsing.
The performance leap is driven by an adaptive thinking mode that allows the model to iterate and refine its work more aggressively than its predecessor. While Opus 4.6 utilizes 30% to 60% more tokens than version 4.5, this increased computational effort translates into a significant 150-point Elo advantage over OpenAI's flagship GPT-5.2. Interestingly, the model makes heavy use of its image viewer tool to visually inspect its own outputs, ensuring professional-grade aesthetics and structural accuracy.
Despite its technical dominance, the model's cost remains a point of contention for enterprise users. With pricing held at $5 per million input tokens and $25 per million output tokens, the combination of premium rates and higher token consumption makes Opus 4.6 the most expensive model on the market. However, for organizations requiring the absolute limit of modern AI capability, its position on the Pareto frontier suggests that the performance gains may justify the financial premium for high-stakes knowledge work.