Large Language Models Could Be Rote Learners

ArXi:2504.08300v5 Announce Type: replace-cross Benchmark-based evaluation, e.g., multiple-choice questions (MCQs) and open-ended questions (OEQs), is widely used for evaluating Large Language Models (LLMs), yet their reliability is undermined by benchmark contamination. When pre-exposed to the testing benchmark during