Why it’s critical to move beyond overly aggregated machine-learning metrics
- •MIT researchers reveal high average performance metrics often mask model failures for specific patient sub-populations.
- •Spurious correlations cause models to perform poorly on 6-75% of new data despite high training scores.
- •New OODSelect algorithm identifies specific data subsets where models fail to maintain their accuracy rankings.
MIT researchers are sounding the alarm on a hidden danger in artificial intelligence: the reliance on overly aggregated Evaluation Metrics. While a diagnostic model might appear highly accurate when evaluated on a massive, combined dataset, new research from the Laboratory for Information and Decision Systems (LIDS) proves this can be a dangerous illusion. The study found that a model hailed as the "best" in one hospital could actually be the worst-performing for up to 75% of patients in a different clinical setting. This occurs because the systems rely on spurious correlations—essentially taking mental shortcuts by connecting irrelevant features, like hospital-specific image markings, to a diagnosis rather than identifying actual anatomical signs of disease. To combat this, the team developed OODSelect, an algorithm designed to pinpoint specific sub-populations where these models fail. By breaking away from the assumption that top-ranked models remain effective everywhere, researchers can better measure how well a system adapts to the real world (Generalization). This method is vital for AI Safety, ensuring that critical healthcare decisions are based on robust medical evidence rather than coincidental data patterns that vanish in new environments.