What a time to be alive from 1tk/sec to 20-100tk/sec for huge models

r/LocalLLaMA
AI Research

Llama405b q4 at 1.2tk/sec 2 years ago was something to be excited about. That same hardware will now run HUGE state of the art models (kimik2.6, deepseekv4flash, minimax2.7, step3.5flash, qwen3.5-397b) at 30tk-100tk/sec while crushing llama405b.:-/ I recall folks asking why anyone would want to run Llama405b at 1.2/tk, etc. My answer when folks asked me was that I wanted to be ready for when AGI arrived. If it meant being able to run my own super AI at 1tk/sec I wanted that option. It turned out better than I could have ever imagined, we do have super AGI and we can run them cheap and fast.