CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration

ArXi:2603.20741v1 Announce Type: new Recent advancements in text-to-image synthesis have been largely propelled by diffusion-based models, yet achieving precise alignment between text prompts and generated images remains a persistent challenge. We find that this difficulty arises primarily from the limitations of conventional diffusion loss, which provides only implicit supervision for modeling fine-grained text-image correspondence. In this paper, we