FishSpeech S2 Pro streaming code (380ms TTFA, tested on RTX 5090)

r/LocalLLaMA •
Machine Learning AI Hardware

So. uh. yes I did a lot of debugging and learning and I'm your average webde, not ML engineer so my apologies for cursed code 🤣 Streaming should work end-to-end with low TTFA (~400ms until first audio chunk on Arch Linux, RTX 5090, NVIDIA driver 595.45.04, 9950x3D); there’s still work to do on memory, TTFA, and longer prompts. Here's some ideas: Figure out how to properly torch.compile, right now it just recompiles after warmup on smoke e2e test; and every recompile takes like 6 minutes. Stream tokens into vocoder with a schedule (per lengyue), not one big chunk.