GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations

ArXi:2605.07053v1 Announce Type: cross Benchmarks like GSM8K are popular measures of mathematical reasoning, but leaderboard gains can overstate true capability due to memorization of fixed test sets. Most robustness variants apply surface-level perturbations (paraphrases, renamings, number swaps, distractors) that largely preserve the underlying facts, and static releases can themselves become memorization targets over time. We