[P] Built GPT-2, Llama 3, and DeepSeek from scratch in PyTorch - open source code + book

I wrote a book that implements modern LLM architectures from scratch. The part most relevant to this sub: Chapter 3 takes GPT-2 and swaps exactly 4 things to get Llama 3.2-3B: LayerNorm → RMSNorm Learned positional encodings → RoPE GELU → SwiGLU Multi-Head Attention → Grouped-Query Attention Then loads Meta's real pretrained weights. Chapter 5 builds DeepSeek's full architecture: MLA with the absorption trick, decoupled RoPE, MoE with shared experts and fine-grained segmentation, auxiliary-loss-free load balancing, Multi-Token Prediction, and FP8 quantisation. All code is open.