[R] Depth-first pruning transfers: GPT-2 → TinyLlama with stable gains and minimal loss

TL;DR: Removing the right layers (instead of shrinking all layers) makes transformer models ~8-12% smaller with only ~6-8% quality loss, and this now works across architectures (GPT-2 + TinyLlama) with near-zero variance. I’ve been experimenting with depth-first pruning - removing entire layers based on sensitivity rather than shrinking model width. Started on GPT-2… Just validated it on TinyLlama 1.1B with full 3-seed replication.