I benchmarked 31 STT models on medical audio — VibeVoice 9B is the new open-source leader at 8.34% WER, but it's big and slow

r/LocalLLaMA
Generative AI NLP AI Research

TL;DR: v3 of my medical speech-to-text benchmark. 31 models now (up from 26 in v2). Microsoft VibeVoice-ASR 9B takes the open-source crown at 8.34% WER, nearly matching Gemini 2.5 Pro (8.15%). But it's 9B params, needs ~18GB VRAM (ran it on an H100 since I had easy access, but an L4 or similar would work too), and even on H100 it's slow - 97s per file vs 6s for Parakeet. Also found bugs in Whisper's text normalizer that were inflating WER by 2-3% across every model.