Rewriting model inference with CUDA kernels: the bottleneck was not just GEMM [P]

I’ve been working on a CUDA-first inference runtime for small-batch / realtime ML workloads. The core idea is simple: instead of treating PyTorch / TensorRT / generic graph runtimes as the main execution path, I rewrite the model inference path directly with C++/CUDA kernels. This started from robotics / VLA workloads, but the problem is general. In small-batch inference, the bottleneck is often not just a single slow