GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation

ArXi:2512.21276v2 Announce Type: replace Modern deep learning methods typically treat image sequences as large tensors of sequentially stacked frames. However, is this straightforward representation ideal given the current state-of-the-art (SoTA)? In this work, we address this question in the context of generative models and aim to devise a effective way of modeling image sequence data.