vibevoice.cpp: Microsoft VibeVoice (TTS + long-form ASR with diarization) ported to ggml/C++, runs on CPU/CUDA/Metal/Vulkan, no Python at inference

r/LocalLLaMA
Generative AI NLP AI Hardware

A few weeks ago I shipped vibevoice.cpp, a pure-C++ ggml port of Microsoft VibeVoice (the speech-to-speech model with voice cloning, ). Wanted to post a follow-up here because we're at a point where the engine has grown well past "first-pass port" and into something other people might actually want to run. This work was brought to you with <3 from the LocalAI team! What it does: TTS with pre-converted voice prompts (any of upstream's.pt voices, ours or yours converted via scripts/convert_voice_to_gguf.py): give it a 30s reference clip, generate 24kHz speech in the cloned voice.