Multimodal AI Struggles to Read Text as Pixels
- •Researchers identify a significant modality gap when AI processes text as images versus raw tokens.
- •Variations in font and resolution can impact visual text accuracy by up to 47 percentage points.
- •New self-distillation technique boosts image-based math reasoning from 30.71% to over 92% accuracy.
Multimodal large language models (MLLMs) are celebrated for their ability to interpret visual data, yet a fundamental discrepancy exists in how they process information. When text is presented as raw pixels—such as in a screenshot or a scanned document—rather than as digital tokens, model performance often takes a dramatic hit. This phenomenon, termed the "modality gap," reveals that even the most advanced systems struggle to bridge the divide between visual perception and logical analysis.
A systematic evaluation of seven leading MLLMs across various benchmarks highlighted that the gap is highly sensitive to formatting. Simple changes in font or image resolution can swing accuracy by as much as 47 percentage points. Interestingly, while models retain their underlying knowledge, they suffer from a "reasoning collapse" when forced to interpret visual inputs. This suggests that the challenge is not a lack of intelligence, but a failure in the initial "reading" phase that disrupts subsequent logical steps.
To combat this, researchers introduced a self-distillation method where a model learns from its own best performances. By training models on their own text-based logic traces paired with corresponding image inputs, they successfully taught the AI to maintain its logical flow regardless of input format. On the GSM8K math benchmark, this approach catapulted image-mode accuracy from 30% to over 92%. This breakthrough suggests that the next generation of AI will likely integrate visual and textual data with much higher parity, enabling more reliable analysis of complex documents.