Memory Sparse Attention seems to be a novel approach to long context (up to 100M tokens)

r/LocalLLaMA
AI Hardware AI Research

Really interesting approach to solving long context rot. Basically a hyper efficient index of KV cache is d in the GPU's VRAM that points to compressed KV cache d in system RAM. It requires