Baidu Launches Qianfan-OCR for Advanced Document Parsing
- •Baidu introduces Qianfan-OCR, a 4B-parameter model unifying document parsing and layout analysis.
- •Layout-as-Thought mechanism uses "think tokens" to generate structural coordinates before textual output.
- •Model claims top spots on OmniDocBench and OlmOCR, outperforming larger proprietary competitors.
Baidu has introduced Qianfan-OCR, a specialized 4B-parameter vision-language model designed to streamline the complex process of document intelligence. Traditionally, extracting information from PDFs or images requires multiple steps: identifying the layout, recognizing text, and then structuring the data. Qianfan-OCR simplifies this by unifying document parsing, layout analysis, and high-level understanding into a single, cohesive architecture. This end-to-end approach allows the model to handle diverse tasks like direct image-to-Markdown conversion and complex table extraction within one workflow.
The breakthrough feature of the model is its "Layout-as-Thought" mechanism. This process uses special "think tokens" to trigger an internal reasoning phase where the model generates structured layout representations—such as bounding boxes and reading orders—before producing the final text. By visualizing the document's structure first (grounding), the model significantly reduces errors in complex layouts that often confuse standard AI. This modular "thinking" step ensures high accuracy without the latency issues typically seen in multi-stage pipelines.
In performance evaluations, Qianfan-OCR secured the top spot on major benchmarks like OmniDocBench v1.5 and OlmOCR Bench. Remarkably, this 4B-parameter model surpassed significantly larger competitors, including Gemini-3.1-Pro and Qwen3-VL-235B, in key information extraction tasks. Currently available via Baidu’s AI Cloud, the model represents a significant shift toward more efficient, specialized multimodal architectures that prioritize structural awareness alongside linguistic capability.