MinerU-Diffusion Speeds OCR via Parallel Diffusion Decoding
- •MinerU-Diffusion replaces sequential text generation with parallel diffusion denoising for 3.2x faster document OCR.
- •New block-wise decoder architecture enables stable processing of long sequences and complex document layouts.
- •Model achieves superior robustness on Semantic Shuffle benchmarks by reducing dependence on linguistic priors.
Traditional systems for reading documents typically convert images of text into digital data by predicting one character or word at a time. While effective, this sequential approach often leads to error propagation, where a single mistake early in a document cascades through the rest of the text, significantly slowing down the processing of long files.
MinerU-Diffusion introduces a paradigm shift by treating document conversion as an inverse rendering task. Instead of reading left-to-right, the model uses a diffusion-based framework to generate the entire document content simultaneously through parallel denoising. This method allows the system to refine the text and layout across the whole page at once, much like an artist refining a sketch into a finished painting.
The framework utilizes a specialized block-wise decoder and a curriculum learning strategy—a training method that starts with easier tasks before progressing to complex ones. These innovations result in a 3.2x increase in speed compared to traditional models. Furthermore, by relying more on visual cues than predictable language patterns, MinerU-Diffusion demonstrates exceptional accuracy in parsing dense tables, mathematical formulas, and irregular document structures.