MaskAttn-SDXL: Controllable Region-Level Text-To-Image Generation

ArXi:2509.15357v2 Announce Type: replace-cross Diffusion models have achieved strong results in text-to-image generation, but important limitations remain as prompts become structured and multi-object. On the architecture side, U-Net backbones are efficient and stable, yet their locality makes global coordination harder, while Transformer-based diffusion models improve global interactions but at substantially higher compute and memory cost.