AI RESEARCH

ForMaT: Dataset for Visually-Grounded Multilingual PDF Translation

arXiv CS.CL

ArXi:2605.15794v1 Announce Type: new We present ForMaT (Format-Preserving Multilingual Translation), a parallel corpus of 3,956 PDFs across 15 language pairs that preserves original layout metadata proposed for multimodal machine translation. To ensure structural diversity in the dataset, we employ K-Medoids sampling over 45 geometric features, capturing complex elements like nested tables and formulas to focus only on visually diverse PDF documents.