RealCQA-V2: A Diagnostic Benchmark for Structured Visual Entailment over Scientific Charts

ArXi:2410.22492v3 Announce Type: replace Multimodal reasoning models often produce fluent answers ed by seemingly coherent rationales. Existing benchmarks evaluate only final-answer correctness. They do not atomic visual entailment verification of intermediate steps, especially visual compositional logic. This limitation is especially acute in scientific chart understanding, where answers depend on deterministically grounded visual semantics such as axes, legends, and quantitative relations. We.