LITTA: Late-Interaction and Test-Time Alignment for Visually-Grounded Multimodal Retrieval

ArXi:2603.26683v1 Announce Type: cross Retrieving relevant evidence from visually rich documents such as textbooks, technical reports, and manuals is challenging due to long context, complex layouts, and weak lexical overlap between user questions and ing pages. We propose LITTA, a query-expansion-centric retrieval framework for evidence page retrieval that improves multimodal document retrieval without retriever re