The exact KV cache usage of DeepSeek V4

r/LocalLLaMA
NLP Open Source AI

Figure 1 of DSV4 paper seems to imply that DSV3.2 uses ~50GB at 1m context and DSV4 uses ~5GB: From my own calculations, the correct FP16 KV cache at 1m context should be: Model Params 128k 160k 1m KV% V3.x 671B 8.58GiB 10.72GiB 68.63GiB 5.11% V4 Flash 284B 0.76GiB 0.95GiB 6.08GiB 1.07% V4 Pro 1600B 1.09GiB 1.36GiB 8.71GiB 0.272% So while KV cache saving is not 9.5x but 7.879x. It is still very impressive. If you look at the KV% metric, then we are seeing close to 20x gain. This basically obliterates all current transformer-SSM hybrid models' KV cache usage.