Mitigating Latent Mismatch in cVAE-Based Singing Voice Synthesis via Flow Matching

ArXi:2601.00217v2 Announce Type: replace-cross Singing voice synthesis (SVS) aims to generate natural and expressive singing waveforms from symbolic musical scores. In cVAE-based SVS, however, a mismatch arises because the decoder is trained with latent representations inferred from target singing signals, while inference relies on latent representations predicted only from conditioning inputs. This discrepancy can weaken fine expressive acoustic details in the synthesized output.