Qianfan-OCR — 4B end-to-end document AI model: 93.12 on OmniDocBench v1.5, 192 languages, runs on a single A100 with vLLM

r/LocalLLaMA
Generative AI Computer Vision NLP AI Tools

We just open-sourced Qianfan-OCR, a 4B-parameter end-to-end vision-language model for document understanding. Instead of the typical detect → recognize → LLM pipeline, this model handles OCR, layout analysis, table extraction, formula recognition, chart understanding, and key information extraction - all in one forward pass. Core idea: Layout-as-Thought The model can optionally enter a reasoning phase before generating output, where it reasons about bounding boxes, element types, and reading order. Think of it as Chain-of-Thought, but for document layout.