OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

ArXi:2605.12480v1 Announce Type: cross Recent advances in joint audio-video generation have been remarkable, yet real-world applications demand strong per-modality fidelity, cross-modal alignment, and fine-grained synchronization. Reinforcement Learning (RL) offers a promising paradigm, but its extension to multi-objective and multi-modal joint audio-video generation remains unexplored.