Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks

ArXi:2605.07024v1 Announce Type: new Large Language Models for code generation frequently produce hallucinations in Fill-in-the-Middle (FIM) tasks -- plausible but incorrect completions such as invented API methods, invalid parameters, undefined variables, or non-existent imports. These failures pass superficial review yet