What are the key points?

Claude Sonnet 4.6 overtakes Opus 4.6 as the top-performing model on GDPval-AA. New model achieves 1633 Elo score using 280 million tokens for adaptive thinking tasks. Sonnet 4.6 demonstrates 85% win rate improvement over predecessor in agentic loop testing.

Claude Sonnet 4.6 Claims Top Spot on GDPval-AA Benchmark

•Claude Sonnet 4.6 overtakes Opus 4.6 as the top-performing model on GDPval-AA.
•New model achieves 1633 Elo score using 280 million tokens for adaptive thinking tasks.
•Sonnet 4.6 demonstrates 85% win rate improvement over predecessor in agentic loop testing.

Artificial Analysis has crowned Claude Sonnet 4.6 as the new leader in its GDPval-AA benchmark, a rigorous evaluation of how models handle complex, real-world knowledge work. In a surprising turn, this mid-tier model slightly outperformed Anthropic’s flagship Opus 4.6, securing an Elo rating of 1633. This performance was achieved using the newly introduced adaptive thinking mode, which allows the model to allocate more computational effort to difficult problems.

The leap in performance comes with a significant increase in resource consumption. Sonnet 4.6 processed 280 million tokens to complete the benchmark, a nearly fivefold increase over the 58 million tokens used by its predecessor, Sonnet 4.5. Interestingly, while Sonnet 4.6 is now the highest-ranking model, it proved less efficient than Opus 4.6, which completed similar tasks using roughly 40% fewer tokens. This trade-off suggests that while Sonnet can reach elite performance levels, it does so by "thinking" much longer and more expensively.

The GDPval-AA metric specifically focuses on agentic performance, where models operate in a continuous loop to solve multi-step problems like data analysis or video editing. By utilizing shell access and web browsing through an open-source harness called Stirrup, these models move beyond simple chat interfaces toward autonomous problem-solving. The underlying dataset, originally curated by OpenAI, represents 44 different occupations, ensuring the results reflect the types of high-stakes tasks models face in professional environments today.

Artificial Analysis has crowned Claude Sonnet 4.6 as the new leader in its GDPval-AA benchmark, a rigorous evaluation of how models handle complex, real-world knowledge work. In a surprising turn, this mid-tier model slightly outperformed Anthropic’s flagship Opus 4.6, securing an Elo rating of 1633. This performance was achieved using the newly introduced adaptive thinking mode, which allows the model to allocate more computational effort to difficult problems.

The leap in performance comes with a significant increase in resource consumption. Sonnet 4.6 processed 280 million tokens to complete the benchmark, a nearly fivefold increase over the 58 million tokens used by its predecessor, Sonnet 4.5. Interestingly, while Sonnet 4.6 is now the highest-ranking model, it proved less efficient than Opus 4.6, which completed similar tasks using roughly 40% fewer tokens. This trade-off suggests that while Sonnet can reach elite performance levels, it does so by "thinking" much longer and more expensively.

The GDPval-AA metric specifically focuses on agentic performance, where models operate in a continuous loop to solve multi-step problems like data analysis or video editing. By utilizing shell access and web browsing through an open-source harness called Stirrup, these models move beyond simple chat interfaces toward autonomous problem-solving. The underlying dataset, originally curated by OpenAI, represents 44 different occupations, ensuring the results reflect the types of high-stakes tasks models face in professional environments today.

Claude Sonnet 4.6 Claims Top Spot on GDPval-AA Benchmark

Tags