UltraGen: Efficient Ultra-High-Resolution Image Generation with Hierarchical Local Attention

ArXi:2510.16325v2 Announce Type: replace Ultra-high-resolution text-to-image generation is increasingly vital for applications requiring fine-grained textures and global structural fidelity, yet state-of-the-art text-to-image diffusion models such as FLUX and SD3 remain confined to sub 2MP (< $1K\times2K$) resolutions due to the quadratic complexity of attention mechanisms and the scarcity of high-quality high-resolution