AI RESEARCH

Attention to Mamba: A Recipe for Cross-Architecture Distillation

arXiv CS.LG

ArXi:2604.14191v1 Announce Type: cross State Space Models (SSMs) such as Mamba have become a popular alternative to Transformer models, due to their reduced memory consumption and higher throughput at generation compared to their Attention-based counterparts. On the other hand, the community has built up a considerable body of knowledge on how to train Transformers, and many pretrained Transformer models are readily available.