Activation Oracles: Training LLMs to Explain AI Neural Signals
- •Anthropic develops Activation Oracles to interpret internal AI neural signals using natural language
- •Oracles successfully uncover secret knowledge and hidden misalignment in fine-tuned models
- •Performance scales predictably with increased training data quantity and task diversity
Anthropic researchers, including Adam Karvonen and James Chua, have developed a new method called Activation Oracles (AOs) to solve the 'black box' problem of AI. Instead of using complex math to guess what an AI is thinking, they train a second LLM to read the internal neural activations—the mathematical signals within a model’s processing layers—of a target AI. By treating these signals as a new input modality similar to text, the oracle can answer questions like 'What secret word is this model hiding?' even if it wasn't specifically trained on that exact data. The team found that these oracles are surprisingly good at Generalization, meaning they can analyze models that have undergone Fine-tuning (specialized training) for tasks they've never seen before. For instance, in a game of Taboo, the oracle successfully identified secret words that the target model was specifically instructed never to reveal. This suggests that AOs can uncover hidden knowledge or misalignment that might otherwise go unnoticed by human developers. Unlike traditional mechanistic interpretability, which looks at the specific 'gears' of the model, AOs offer a more flexible, natural language approach. While they are more computationally expensive than simple tools, their performance improves as they are trained on more diverse datasets. This makes them a powerful new tool for AI Safety, helping researchers understand why a model behaves the way it does before it is deployed to the public.