DocPrune:Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning

ArXi:2604.22281v1 Announce Type: new Recent advances in vision-language models have nstrated remarkable performance across diverse multi-modal tasks, including document question answering that leverages structured visual cues from text, tables, and figures. However, unlike natural images, document images contain large backgrounds and only sparse ing evidence, leading to the inefficient consumption of substantial computational resources, especially for long documents.