Self-Improving Code Generation via Semantic Entropy and Behavioral Consensus

ArXi:2603.29292v1 Announce Type: cross Improving the code generation capabilities of large language models (LLMs) typically relies on supervised fine-tuning or preference optimization, both of which require costly external resources such as powerful teacher models or reliable test units. However, in real-world scenarios, it is much harder to obtain reference solutions and test oracles than problem descriptions and test inputs.