What are the key points?

Anthropic’s Claude Opus 4.5 takes the top spot on the February 2026 SWE-bench Verified leaderboard. Chinese models like MiniMax M2.5 and GLM-5 perform strongly, rivaling established models from Google and OpenAI. OpenAI’s GPT-5.2 ranks sixth overall, while the specialized GPT-5.3-Codex model remains currently unranked.

Claude Opus 4.5 Tops SWE-bench February 2026 Leaderboard

•Anthropic’s Claude Opus 4.5 takes the top spot on the February 2026 SWE-bench Verified leaderboard.
•Chinese models like MiniMax M2.5 and GLM-5 perform strongly, rivaling established models from Google and OpenAI.
•OpenAI’s GPT-5.2 ranks sixth overall, while the specialized GPT-5.3-Codex model remains currently unranked.

The software engineering benchmark SWE-bench recently updated its leaderboard, offering a rare look at model performance without the potential bias of self-reported lab data. This specific run focused on the "Bash Only" track using SWE-bench Verified, a manually curated set of 500 real-world coding challenges pulled from popular open-source repositories. The results reveal a shifting landscape where Western industry giants face stiff competition from emerging global laboratories.

In a surprise, Anthropic’s Claude Opus 4.5 narrowly outperformed its successor, Opus 4.6, to secure the number one position. Close behind are Google’s Gemini 3 Flash and the 229-billion parameter MiniMax M2.5 from China. The inclusion of several other Chinese models—including GLM-5, Kimi K2.5, and DeepSeek V3.2—in the top ten highlights the rapid closing of the gap in specialized coding intelligence and autonomous problem-solving capabilities.

OpenAI’s representation was notably lower than expected, with GPT-5.2 landing in sixth place. Analysts suggest that the absence of GPT-5.3-Codex, OpenAI’s dedicated programming model, is likely due to it not yet being available via the standard API. To ensure a fair comparison, the benchmark utilized a uniform system prompt across all models, effectively isolating raw reasoning abilities from the influence of custom prompt engineering.

These independent tests provide crucial validation as developers increasingly rely on AI to manage complex codebases. By measuring models against actual issues from projects like Django and Scikit-learn, the benchmark offers a realistic view of how these tools perform in production environments. This blend of rigorous evaluation and practical application marks a significant milestone in the evolution of autonomous development assistants.

The software engineering benchmark SWE-bench recently updated its leaderboard, offering a rare look at model performance without the potential bias of self-reported lab data. This specific run focused on the "Bash Only" track using SWE-bench Verified, a manually curated set of 500 real-world coding challenges pulled from popular open-source repositories. The results reveal a shifting landscape where Western industry giants face stiff competition from emerging global laboratories.

In a surprise, Anthropic’s Claude Opus 4.5 narrowly outperformed its successor, Opus 4.6, to secure the number one position. Close behind are Google’s Gemini 3 Flash and the 229-billion parameter MiniMax M2.5 from China. The inclusion of several other Chinese models—including GLM-5, Kimi K2.5, and DeepSeek V3.2—in the top ten highlights the rapid closing of the gap in specialized coding intelligence and autonomous problem-solving capabilities.

OpenAI’s representation was notably lower than expected, with GPT-5.2 landing in sixth place. Analysts suggest that the absence of GPT-5.3-Codex, OpenAI’s dedicated programming model, is likely due to it not yet being available via the standard API. To ensure a fair comparison, the benchmark utilized a uniform system prompt across all models, effectively isolating raw reasoning abilities from the influence of custom prompt engineering.

These independent tests provide crucial validation as developers increasingly rely on AI to manage complex codebases. By measuring models against actual issues from projects like Django and Scikit-learn, the benchmark offers a realistic view of how these tools perform in production environments. This blend of rigorous evaluation and practical application marks a significant milestone in the evolution of autonomous development assistants.

Claude Opus 4.5 Tops SWE-bench February 2026 Leaderboard

Tags