What solutions are you using to boost TPS and Context Window?

r/LocalLLaMA
Generative AI Open Source AI

Server Specs: 16 Gigs DDR5 AMD Ryzen 5 7600X 4.7 GHz 6-Core Processor AMD Radeon Sapphire Nitro+ 7900XTX NZXT N7 B650E ATX AM5 Motherboard Performance: I'm running Qwen27b Q4 at 80k context on a Sapphire Nitro+ Radeon 7900XTX 24Gb at 40 t/s. My setup is Llama.cpp + Vulcan. Question: I've been having a blast with it, but it's time for some extra power under the hood. The return rate is just slow enough to be annoying with tooling, and the context window is just short enough to not handle low-end big tasks. In a perfect world I'm running 120-140 Context at 60t/s.