VIBEVOICE-ASR Technical Report

ArXi:2601.18184v2 Announce Type: replace-cross This report presents VibeVoice-ASR, a general-purpose speech understanding framework built upon VibeVoice, designed to address the persistent challenges of context fragmentation and multi-speaker complexity in long-form audio (e.g., meetings, podcasts) that remain despite recent advancements in short-form speech recognition. Unlike traditional pipelined approaches that rely on audio chunking, VibeVoice-ASRs single-pass processing for up to 60 minutes of audio.