Test-time RL alignment exposes task familiarity artifacts in LLM benchmarks

ArXi:2603.12875v1 Announce Type: new Direct evaluation of LLMs on benchmarks can be misleading because comparatively strong performance may reflect task familiarity rather than capability. The train-before-test approach controls for task familiarity by giving each model task-relevant