Writing an LLM compiler from scratch: PyTorch to CUDA in 5,000 lines of Python

Hey r/LocalLLaMA, I wanted to come up with a simple overview of the modern ML compiler stack, essentially what happens between model.generate and the GPU executing a kernel. However, the stack is brutal to read. TVM is 500K+ lines of C++. PyTorch piles Dynamo, Inductor, and Triton on top of each other. Then there's XLA, MLIR, Halide, and Mojo. Instead, I decided to take a different approach and just build one from scratch. Just pure Python and raw CUDA. Take a small model (Qwen2.5-7B, TinyLlama) and compile it into a sequence of CUDA kernels.