Understanding Pruning Regimes in Vision-Language Models Through Domain-Aware Layer Selection

ArXi:2603.20275v1 Announce Type: cross Transformer-based vision-language models (VLMs) contain substantial depth redundancy, yet the effect of removing specific decoder layers remains poorly understood, especially for domains that require tight coupling between perception and multi-step reasoning. We study structured decoder layer pruning through the lens of domain-aware activation similarity, measuring how strongly each layer transforms representations for math versus non-math inputs.