CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation

ArXi:2604.12268v1 Announce Type: cross Large language models (LLMs) can generate code from natural language, but the extent to which they capture intended program behavior remains unclear. Executable behavioral specifications, defined via preconditions and postconditions, provide a concrete means to assess such understanding. However, existing work on specification generation is constrained in evaluation methodology, task settings, and specification expressiveness. We