RSNA Large Language Model Benchmark Dataset for Chest Radiographs of Cardiothoracic Disease: Radiologist Evaluation and Validation Enhanced by AI Labels (REVEAL-CXR)
- •RSNA releases REVEAL-CXR benchmark with 200 expert-verified chest radiographs for multimodal LLM evaluation
- •Researchers utilize GPT-4o and Phi-4-Reasoning to automate initial labeling before rigorous radiologist validation
- •Dataset includes 100 public and 100 holdout cases to ensure unbiased assessment of medical AI
The Radiological Society of North America (RSNA) has introduced REVEAL-CXR, a high-quality benchmark designed to bridge the gap between general AI capabilities and specialized medical diagnostics. While multimodal large language models (MLLM) have shown promise in passing board exams, their clinical utility remains difficult to measure without expert-curated data. This new dataset addresses that vacuum by providing 200 chest radiographic studies, each meticulously verified by a panel of seventeen subspecialty radiologists to ensure ground-truth accuracy.
To streamline the labor-intensive process of medical labeling, the team implemented a hybrid AI-assisted workflow. They leveraged OpenAI’s GPT-4o to extract abnormal findings from existing reports, which were then mapped to specific diagnostic categories using Phi-4-Reasoning, a locally hosted model optimized for logical deduction. This semi-automated pipeline allowed experts to focus on validating suggestions rather than starting from scratch, significantly increasing efficiency while maintaining the "gold standard" of human oversight.
The resulting benchmark is split into two halves: a 100-case public set for development and a 100-case holdout set reserved for independent RSNA evaluation. By prioritizing rarer findings and complex clinical scenarios, REVEAL-CXR challenges models to go beyond surface-level patterns. This initiative highlights a growing trend in the industry: using AI to build the very guardrails and evaluation metrics needed to verify safety in high-stakes environments like healthcare.