THINKSAFE: Self-Generated Safety Alignment for Reasoning Models
- •ThinkSafe framework boosts AI safety without relying on external teacher model distillation.
- •Lightweight refusal steering extracts latent safety knowledge to generate internal reasoning traces.
- •Method maintains high reasoning performance on DeepSeek and Qwen models at lower costs.
Large reasoning models often fall into a trap: the intense optimization required for complex logical tasks can make them overly compliant, causing them to ignore safety protocols in favor of following user instructions. This "safety degradation" is a major hurdle for specialized models. Traditionally, developers have fixed this by training the model to mimic a safer "teacher" model. However, this often creates a clash between the teacher’s style and the model’s own way of thinking, which can weaken its core reasoning abilities and logical consistency.
The ThinkSafe framework offers a clever alternative by looking inward. Instead of importing safety from an external source, it uses "lightweight refusal steering" to tap into the model’s existing, hidden knowledge about what constitutes harm. It guides the model to explain why it should refuse a dangerous prompt using its own internal logic, a process known as Chain-of-Thought. By generating these "in-distribution" safety explanations and then fine-tuning on them, the model learns to be safe while staying true to its original reasoning patterns.
In testing, ThinkSafe outperformed traditional methods like GRPO by achieving significantly higher safety scores without sacrificing the ability to solve complex math or logic puzzles. Most impressively, it does so with a much smaller computational footprint, making safety alignment more accessible for researchers. This self-evolution approach suggests that the next generation of AI might find its moral compass from within its own learned data rather than through external censorship.