Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation

ArXi:2603.18652v1 Announce Type: cross Reliably extracting tables from PDFs is essential for large-scale scientific data mining and knowledge base construction, yet existing evaluation approaches rely on rule-based metrics that fail to capture semantic equivalence of table content. We present a benchmarking framework based on synthetically generated PDFs with precise LaTeX ground truth, using tables sourced from arXi to ensure realistic complexity and diversity.