Test of Time: Rethinking Temporal Signal of Benchmark Contamination

ArXi:2509.00072v3 Announce Type: replace Post-cutoff performance decay has been widely interpreted as a temporal signal for benchmark contamination. We critically examine this belief and nstrate that this temporal signal is highly sensitive to how benchmark questions are constructed. Specifically, we show that LLM-generated questions can produce remarkably different temporal patterns compared to fill-in-the-blank questions directly retrieved from the very same materials.