AI RESEARCH

ContractEval: A Benchmark for Evaluating Contract-Satisfying Assertions in Code Generation

arXiv CS.AI

ArXi:2510.12047v4 Announce Type: replace Current code generation evaluation measures functional correctness on well-formed inputs that satisfy all input preconditions. This paradigm has a critical limitation: task descriptions often leave these preconditions implicit, while evaluation filters out inputs that violate them. As a result, generated code may achieve high pass scores while failing to enforce the preconditions that the task actually requires. To address this gap, we