Quality Over Quantity: New Method Boosts LLM Synthetic Data Efficiency
- •Feature Activation Coverage (FAC) measures data diversity using internal model activations rather than surface text variation.
- •FAC Synthesis matches 300K samples' performance with just 2K targeted synthetic samples on the AlpacaEval 2.0 benchmark.
- •Researchers discovered a shared, interpretable feature space across LLaMA, Mistral, and Qwen model families.
Training modern Large Language Models requires massive amounts of high-quality data, often leading researchers to rely on synthetic data to fill the gaps. However, current methods usually focus on textual diversity—simply making sure the sentences look different on the surface. A new research paper titled "Less is Enough" argues that this approach misses the mark, introducing a metric called Feature Activation Coverage (FAC) to look deeper into the model's internal logic.
FAC measures how well a dataset covers the various conceptual features a model has learned, such as specific reasoning patterns or factual associations. By using a sparse autoencoder—a tool that translates complex neural patterns into human-understandable concepts—the researchers identify which internal features are underrepresented in a small seed dataset. They then generate new synthetic samples specifically designed to activate those missing features, ensuring every piece of data serves a precise functional purpose.
The results are striking: with just 2,000 targeted synthetic samples, the researchers matched the performance of the popular MAGPIE dataset, which uses 300,000 samples. This 150x improvement in data efficiency held true across various tasks, including instruction following and toxicity detection. Perhaps most surprisingly, the team found that models from different families, like LLaMA and Mistral, actually share many of the same internal feature spaces. This suggests that data optimized for one model could potentially benefit many others, paving the way for more efficient generalization across the AI ecosystem.