An overview of modern LLM compiler stack: writing an interactive and hackable compiler

Hey r/LocalLLaMA, Production ML compiler stack is brutal: TVM is 500K+ lines of C++. PyTorch piles Dynamo, Inductor, and Triton on top of each other. XLA, MLIR, Halide, Mojo. It is, arguably, the most important piece of modern compute infrastructure, and there is little information available on core concepts. To fill the gap, I built a small ML compiler from scratch: pure Python and raw CUDA, no library use. It takes a small transformer (TinyLlama, Qwen2.5-7B) and lowers it to a sequence of CUDA kernels through six IRs. On RTX 5090, the autotuned stack lands at a geomean of 0.96× vs.