Decoupled Attention from Weights - Gemma 4 26B

r/LocalLLaMA
Generative AI Open Source AI

Absolutely unbelievably exciting work, split attention (i.e. a couple of GB) onto local machine and the weights onto another local machine (say a cheap Xeon) to basically bypass the scale issue with local LLMs completely! Repo with functional code: edit: just found for excellent overview of what's happening here. submitted by /u/yeah-ok [link] [comments]