Depth-first pruning seems to transfer from GPT-2 to Llama (unexpectedly well)

r/artificial
Generative AI NLP Open Source AI AI Research

TL;DR: Removing the right transformer layers (instead of shrinking all layers) gives smaller, faster models with minimal quality loss - and this seems to transfer from GPT-2 to Llama. been experimenting with a simple idea: instead of shrinking model width, just remove entire layers based on sensitivity and then recover with distillation. Originally tested it on GPT-2 (124M) and it worked pretty well. Decided to try the exact same approach on TinyLlama 1.1B to see if it was just a fluke.