FlipVQA: Scaling Multi-modal Instruction Tuning via Textbook-to-Knowledge Synthesis

ArXi:2511.16216v2 Announce Type: replace Textbooks are among the richest repositories of human-verified reasoning knowledge, yet their complex layouts contain multi-column typesetting, cross-page question answer separation, and interleaved figures, make automated extraction of structured QA and VQA pairs extremely challenging. Existing alternatives either synthesize data from scratch, which lacks authentic problem contexts, or rely on costly expert annotation that cannot scale.