Boosting DeepSeek-R1’s Speed with Customized Speculative Decoding
Together AI Blog
•
AI Hardware
Open Source AI
TLDR: In this blog post, we show that using a custom speculator - trained on your own Deepseek-R1 inference traffic - can yield 1.23-1.45x speedups during decoding (tokens/second), and ~25% reduction in overall cost (same throughput with fewer GPU-hours), relative to Together’s state-of-the-art base speculator. This translates to 1.85-2.97x speedup and ~55% cost reductions when compared to conventional next token prediction. Please reach out to our sales team to learn how to get started with a custo.