PoDAR: Power-Disentangled Audio Representation for Generative Modeling

ArXi:2605.10084v1 Announce Type: cross The performance of audio latent diffusion models is primarily governed by generator expressivity and the modelability of the underlying latent space. While recent research has focused primarily on the former, as well as improving the reconstruction fidelity of audio codecs, we nstrate that latent modelability can be significantly improved through explicit factor disentanglement.