Warps, Memory Hierarchy, and Why Bandwidth Beats FLOPS : How GPUs Actually Work, Part 1

A working mental model of GPU hardware for ML engineers who use these chips daily but have never traced what happens below the CUDA API Generating a single token from a 70B parameters model on an H100 requires reading roughly 140 GB of weights from memory and performing about 140B arithmetic operations on them. That works out to one operation per byte loaded. Memory bandwidth, not compute throughput, determines how fast that token comes out. By the end of this series, the reasoning that leads from the hardware specs to this.