Running 1 trillion parameter LLMs locally at 5 tokens/second - Intel Optane Persistent Memory build

r/LocalLLaMA
Generative AI

Hello! I thought r/LocalLLaMA would be interested in my recent build. As the title states, the build is indeed able to run a 1 trillion parameter model (in this case Kimi K2.5) locally at ~5 tokens/second. The main goal for the build was to avoid using large amounts of expensive GPUs or DRAM and still achieve (relatively) fast speeds, and I was able to meet this goal using some carefully selected parts. The main item that enabled this is Intel Optane Persistent Memory, a discontinued product that many people likely haven’t come across in the context of local inference.