What are the key points?

Anthropic researchers decompose AI errors into systematic bias and incoherent variance components. Study reveals longer reasoning steps lead to increasingly unpredictable and inconsistent model failures. Scaling model size fails to consistently reduce error incoherence on complex reasoning tasks.

Anthropic Research Finds Smarter AI Fails Incoherently

•Anthropic researchers decompose AI errors into systematic bias and incoherent variance components.
•Study reveals longer reasoning steps lead to increasingly unpredictable and inconsistent model failures.
•Scaling model size fails to consistently reduce error incoherence on complex reasoning tasks.

Safety researchers at Anthropic have introduced a new framework to categorize how artificial intelligence fails: the "hot mess" versus systematic misalignment. By applying the classic bias-variance decomposition—a mathematical way to split errors into consistent mistakes and random noise—the team analyzed frontier models. Researchers from EPFL discovered that as tasks become more complex and reasoning chains grow longer, AI behavior doesn't just fail; it becomes increasingly incoherent.

This shift suggests that superintelligent systems might not always be the calculated "paperclip maximizers" often feared in alignment theory. Instead, they may behave more like a distracted human, producing nonsensical or self-undermining actions that follow no clear goal. Interestingly, while larger models are generally more accurate, scaling does not necessarily fix this drift toward incoherence. On the hardest benchmarks, smarter models actually showed a higher proportion of random, inconsistent errors compared to simpler tasks.

These findings have significant implications for AI safety. If frontier models are prone to "industrial accidents" caused by erratic behavior rather than cold, calculated pursuit of wrong objectives, the research community may need to shift focus. Jascha Sohl-Dickstein and the team highlight that preventing "reward hacking"—where a model finds a loophole in its training—remains more critical than trying to constrain a perfectly coherent but misaligned optimizer.

To prove this, researchers trained "mesa-optimizers," which are small models designed to act as optimizers themselves. They found that even in these controlled environments, the gap between knowing the objective and consistently executing it widened as the models became more capable. This suggests that the difficulty of remaining a coherent optimizer grows alongside the

Safety researchers at Anthropic have introduced a new framework to categorize how artificial intelligence fails: the "hot mess" versus systematic misalignment. By applying the classic bias-variance decomposition—a mathematical way to split errors into consistent mistakes and random noise—the team analyzed frontier models. Researchers from EPFL discovered that as tasks become more complex and reasoning chains grow longer, AI behavior doesn't just fail; it becomes increasingly incoherent.

This shift suggests that superintelligent systems might not always be the calculated "paperclip maximizers" often feared in alignment theory. Instead, they may behave more like a distracted human, producing nonsensical or self-undermining actions that follow no clear goal. Interestingly, while larger models are generally more accurate, scaling does not necessarily fix this drift toward incoherence. On the hardest benchmarks, smarter models actually showed a higher proportion of random, inconsistent errors compared to simpler tasks.

These findings have significant implications for AI safety. If frontier models are prone to "industrial accidents" caused by erratic behavior rather than cold, calculated pursuit of wrong objectives, the research community may need to shift focus. Jascha Sohl-Dickstein and the team highlight that preventing "reward hacking"—where a model finds a loophole in its training—remains more critical than trying to constrain a perfectly coherent but misaligned optimizer.

To prove this, researchers trained "mesa-optimizers," which are small models designed to act as optimizers themselves. They found that even in these controlled environments, the gap between knowing the objective and consistently executing it widened as the models became more capable. This suggests that the difficulty of remaining a coherent optimizer grows alongside the