KVCache taking too much Memory. Any solutions(Optimizations, Compressions, etc.,) coming soon/later?

r/LocalLLaMA
Generative AI

I don't see any recent threads on this topic so posted this. As mentioned in title, KVCache taking too much Memory(Sometime even than models' size during long context. Check Images for example). Since recent months, we're getting models s up to 256K context base level & then extend it to 1M using Yarn. Recent models like Qwen3-Next & Qwen3.5 series holding better with longer context without reducing speed much(comparing to other models). For models, at least we have this Pruning thing.