POP: Prefill-Only Pruning for Efficient Large Model Inference

ArXi:2602.03295v2 Announce Type: replace-cross Large Language Models (LLMs) and Vision-Language Models (VLMs) have nstrated remarkable capabilities. However, their deployment is hindered by significant computational costs. Existing structured pruning methods, while hardware-efficient, often suffer from significant accuracy degradation. In this paper, we argue that this failure stems from a stage-agnostic pruning approach that overlooks the asymmetric roles between the prefill and decode stages. By