ShishuLM : Achieving Optimal and Efficient Parameterization with Low Attention Transformer Models

ArXi:2510.13860v2 Announce Type: replace-cross While the transformer architecture has achieved state-of-the-art performance on natural language processing tasks, these models impose substantial memory and computational overhead. Recent research has identified significant architectural redundancies within these models, particularly in the attention sub-layers in the top layers, presenting opportunities for optimization without compromising performance. Taking insights from research on inference-time layer pruning and depth-dependent computation in language models, we.