AI RESEARCH
SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation
arXiv CS.CL
•
ArXi:2605.12022v1 Announce Type: new Large Language Models (LLMs) achieve strong performance on standard knowledge evaluation benchmarks, yet recent work shows that their knowledge capabilities remain brittle under question variants that test the same knowledge in different forms. Robustness augmentation of existing knowledge evaluation benchmarks is. therefore. necessary, but current LLM-assisted generate-then-verify pipelines are costly and difficult to scale due to low-yield variant generation and unreliable variant verification.