IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

r/LocalLLaMA
AI Hardware Open Source AI AI Tools

This repository provides a patch for SGLang and vLLM that enables IndexCache inference acceleration for models using DeepSeek Sparse Attention (DSA), including DeepSeek-V3.2 and GLM-5. TL;DR: IndexCache eliminates up to 75% of indexer computations in DSA through cross-layer index reuse - achieving up to 1.82× prefill speedup and 1.48× decode speedup with negligible quality degradation. One if/else branch, zero extra GPU memory.