Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing

ArXi:2603.24326v1 Announce Type: new Document parsing is a fine-grained task where image resolution significantly impacts performance. While advanced research leveraging vision-language models benefits from high-resolution input to boost model performance, this often leads to a quadratic increase in the number of vision tokens and significantly raises computational costs. We attribute this inefficiency to substantial visual regions redundancy in document images, like background.