I trained the same GPT architecture twice — CPU vs GPU, 0.82M vs 10.82M params, full logs inside

Built a character-level GPT from scratch in PyTorch - no pre-trained weights, no HuggingFace, no shortcuts. Trained the same architecture twice under very different compute conditions to measure exactly what scaling does to loss and output quality. Repo: --- **Architecture (both runs)** Standard GPT decoder stack - multi-head causal self-attention, learned positional embeddings, LayerNorm + residuals, AdamW (lr=3e-4), dropout=0.2. Only the scale differs between runs.