Seeking the Absolute Lowest Latency for Qwen 3.5 9B: Best Inference Engine for 1-Stream Real-Time TTS?
r/LocalLLaMA
•
Machine Learning
Generative AI
AI Hardware
Open Source AI
Hi everyone, I'm building a real-time voice chat pipeline (STT -> LLM -> TTS) and I’m hitting a bottleneck in the "Time to Sentence" part. My goal is to minimize the total latency for generating a 100-token response. My Requirements: * Model: Qwen 3.5 9B (currently testing FP16 and EXL3 quants). * Hardware: 1x NVIDIA RTX 3090 TI. * Metric: Lowest possible TTFT (Time To First Token) + Highest TPS (Tokens Per Second) for a single stream (Batch Size 1). * Target: Total time for ~100 tokens should be as close to 500-700ms as possible or lower.