A Single Image and Multimodality Is All You Need for Novel View Synthesis

ArXi:2602.17909v2 Announce Type: replace Diffusion-based approaches have recently nstrated strong performance for single-image novel view synthesis by conditioning generative models on geometry inferred from monocular depth estimation. However, in practice, the quality and consistency of the synthesized views are fundamentally limited by the reliability of the underlying depth estimates, which are often fragile under low-texture, adverse weather, and occlusion-heavy real-world conditions.