[R] Hybrid attention for small code models: 50x faster inference, but data scaling still dominates

TLDR: Forked pytorch and triton internals. Changed attention so its linear first layer, middle quadratic layer, last linear layer Inference got much faster with a low perplexity hit in tests. I trained a 25.6M parameter Rust-focused language model from scratch using a byte-level GPT-style decoder. The main result is that increasing dataset size mattered than any architectural change. Expanding the corpus from about 31MB of core Rust sources to roughly 173MB by adding a few hundred crates produced a much larger improvement than anything else.