Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

ArXi:2604.11331v1 Announce Type: new 3D scene generation has long been dominated by 2D multi-view or video diffusion models. This is due not only to the lack of scene-level 3D latent representation, but also to the fact that most scene-level 3D visual data exists in the form of multi-view images or videos, which are naturally compatible with 2D diffusion architectures. Typically, these 2D-based approaches degrade 3D spatial extrapolation to 2D temporal extension, which