What are the key points?

OpenAI debuts IH-Challenge dataset to train models on prioritizing trusted system instructions over user requests. New GPT-5 Mini-R model shows significant gains in safety steerability and prompt injection resistance. Instruction hierarchy establishes a clear priority chain: System, followed by Developer, User, and external Tools.

OpenAI Boosts GPT-5 Safety via Instruction Hierarchy

•OpenAI debuts IH-Challenge dataset to train models on prioritizing trusted system instructions over user requests.
•New GPT-5 Mini-R model shows significant gains in safety steerability and prompt injection resistance.
•Instruction hierarchy establishes a clear priority chain: System, followed by Developer, User, and external Tools.

OpenAI has introduced a new training framework designed to solve a fundamental "who do I listen to?" problem in artificial intelligence. As models interact with more sources—ranging from their core safety programming to untrusted data from the web—they often struggle to prioritize conflicting commands. This confusion is the root cause of many safety failures, including prompt injections where a malicious actor hides a command inside a website for the AI to find and follow.

To address this, researchers developed the IH-Challenge, a reinforcement learning dataset that enforces a strict instruction hierarchy: System > Developer > User > Tool. By training models on simple, objectively gradable tasks, OpenAI produced GPT-5 Mini-R. This internal model demonstrates vastly improved "safety steerability," meaning it adheres more strictly to its core safety policies even when a user or an external tool tries to trick it into breaking them.

The beauty of this approach lies in its generalization. Rather than playing a game of "whack-a-mole" with specific hacking techniques, the model learns a foundational rule of thumb: prioritize trusted system messages above all else. This structural change ensures that as AI becomes more autonomous and capable of browsing the web or using apps, it remains anchored to its original safety constraints without sacrificing general performance or becoming overly prone to refusing harmless requests.

OpenAI has introduced a new training framework designed to solve a fundamental "who do I listen to?" problem in artificial intelligence. As models interact with more sources—ranging from their core safety programming to untrusted data from the web—they often struggle to prioritize conflicting commands. This confusion is the root cause of many safety failures, including prompt injections where a malicious actor hides a command inside a website for the AI to find and follow.

To address this, researchers developed the IH-Challenge, a reinforcement learning dataset that enforces a strict instruction hierarchy: System > Developer > User > Tool. By training models on simple, objectively gradable tasks, OpenAI produced GPT-5 Mini-R. This internal model demonstrates vastly improved "safety steerability," meaning it adheres more strictly to its core safety policies even when a user or an external tool tries to trick it into breaking them.

The beauty of this approach lies in its generalization. Rather than playing a game of "whack-a-mole" with specific hacking techniques, the model learns a foundational rule of thumb: prioritize trusted system messages above all else. This structural change ensures that as AI becomes more autonomous and capable of browsing the web or using apps, it remains anchored to its original safety constraints without sacrificing general performance or becoming overly prone to refusing harmless requests.

OpenAI Boosts GPT-5 Safety via Instruction Hierarchy

Tags