AI RESEARCH
SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks
arXiv CS.CL
•
ArXi:2604.17771v1 Announce Type: new Large language models (LLMs) have achieved strong performance on natural language to SQL (NL2SQL) benchmarks, yet their reported accuracy may be inflated by contamination from benchmark queries or structurally similar patterns seen during