VolDiT: Controllable Volumetric Medical Image Synthesis with Diffusion Transformers

ArXi:2603.25181v1 Announce Type: new Diffusion models have become a leading approach for high-fidelity medical image synthesis. However, most existing methods for 3D medical image generation rely on convolutional U-Net backbones within latent diffusion frameworks. While effective, these architectures impose strong locality biases and limited receptive fields, which may constrain scalability, global context integration, and flexible conditioning. In this work, we