What are the key points?

BullshitBench evaluates how AI models handle nonsensical premises rather than providing confident but incorrect answers. Anthropic’s Claude 4.6 leads with 91% pushback rate, significantly outperforming GPT-5.4 and Gemini 3 models. Extended reasoning modes often decrease detection accuracy as models prioritize solving problems over questioning their validity.

BullshitBench Reveals Most AI Models Fail to Detect Nonsense

•BullshitBench evaluates how AI models handle nonsensical premises rather than providing confident but incorrect answers.
•Anthropic’s Claude 4.6 leads with 91% pushback rate, significantly outperforming GPT-5.4 and Gemini 3 models.
•Extended reasoning modes often decrease detection accuracy as models prioritize solving problems over questioning their validity.

AI models are notoriously confident, but a new benchmark called BullshitBench reveals a systemic flaw: they often accept nonsensical premises without question. While traditional hallucinations involve making up facts, this "nonsense acceptance" occurs when a model provides a detailed, authoritative response to a fundamentally broken or unanswerable question.

The study tested over 80 models, uncovering a massive performance gap between providers. Anthropic’s Claude 4.6 emerged as the clear leader, successfully rejecting 91% of nonsensical queries. In contrast, heavyweights like OpenAI’s GPT-5.4 and Google’s Gemini 3 Pro struggled, both flagging less than half of the flawed premises. Even more surprising was the performance of open-source models; Alibaba’s Qwen 3.5 achieved a high 78% detection rate, proving that size isn't the only factor in critical thinking.

Perhaps the most counterintuitive finding involves "reasoning" modes, where models take extra time to think before responding. For many model families, enabling this feature actually reduced their ability to spot nonsense. Instead of scrutinizing the question, the models used their increased computational budget to construct elaborate justifications for the invalid premises. This suggests that "thinking" in current AI is often optimized for compliance rather than skepticism, highlighting a major hurdle for AI reliability.

AI models are notoriously confident, but a new benchmark called BullshitBench reveals a systemic flaw: they often accept nonsensical premises without question. While traditional hallucinations involve making up facts, this "nonsense acceptance" occurs when a model provides a detailed, authoritative response to a fundamentally broken or unanswerable question.

The study tested over 80 models, uncovering a massive performance gap between providers. Anthropic’s Claude 4.6 emerged as the clear leader, successfully rejecting 91% of nonsensical queries. In contrast, heavyweights like OpenAI’s GPT-5.4 and Google’s Gemini 3 Pro struggled, both flagging less than half of the flawed premises. Even more surprising was the performance of open-source models; Alibaba’s Qwen 3.5 achieved a high 78% detection rate, proving that size isn't the only factor in critical thinking.

Perhaps the most counterintuitive finding involves "reasoning" modes, where models take extra time to think before responding. For many model families, enabling this feature actually reduced their ability to spot nonsense. Instead of scrutinizing the question, the models used their increased computational budget to construct elaborate justifications for the invalid premises. This suggests that "thinking" in current AI is often optimized for compliance rather than skepticism, highlighting a major hurdle for AI reliability.

BullshitBench Reveals Most AI Models Fail to Detect Nonsense

Tags