Cache-aware prefill–decode disaggregation (CPD) for up to 40% faster long-context LLM serving

Together AI Blog
Generative AI

Serving long prompts doesn't have to mean slow responses. Learn how Together AI's CPD architecture separates warm and cold inference workloads to deliver 40% higher throughput and dramatically lower time-to-first-token for long-context LLM serving.