Mistral Introduces "Voxtral TTS": An Open-Weight Text-to-Voice Model Capable Of Cloning Any Voice From 3 Seconds Of Audio, Runs In 9 Languages, & Beats Elevenlabs Flash V2.5 With A 68.4% Human Preference Win Rate.

r/LocalLLaMA
Machine Learning Generative AI Open Source AI AI Research AI Tools

ElevenLabs built a moat on proprietary weights and API lock-in. Mistral just put the weights on Hugging Face. The model captures not just the voice but the person. Accents, inflections, intonations, vocal fillers the "ums" and "ahs" that make a voice sound human instead of synthetic. From 3 seconds of reference audio. Zero fine-tuning. Zero shot...