A2RBench: An Automatic Paradigm for Formally Verifiable Abstract Reasoning Benchmark Generation

ArXi:2605.17278v1 Announce Type: cross Abstract reasoning ability reflects the intelligence and generalization capacity of LLMs to extract and apply abstract rules. However, accurately measuring this ability remains challenging: existing benchmarks either rely on expensive manual annotation, limiting their scale, or risk measuring memorization rather than genuine reasoning. To address this, we