From FlashLM to State Flow Machine: stopped optimizing transformers, started replacing them. First result: 79% length retention vs transformers' 2%
r/LocalLLaMA
•
Generative AI
AI Research
Some of you might remember my FlashLM series. I was the student building ternary language models on free tier CPUs. v6 "SUPERNOVA" hit 3500 tok/s with a P-RCSM architecture, no attention, no convolution. Got a lot of great feedback and some deserved criticism about scaling. Why I moved on from FlashLM After v6 I spent several days working on v7. The plan was to scale P-RCSM to 10M+ params with a proper dataset and validate whether the reasoning components actually helped. What I found instead was a ceiling, and it wasn't where I expected.