AI RESEARCH
Attention to Mamba: A Recipe for Cross-Architecture Distillation
arXiv CS.LG
•
ArXi:2604.14191v1 Announce Type: cross State Space Models (SSMs) such as Mamba have become a popular alternative to Transformer models, due to their reduced memory consumption and higher throughput at generation compared to their Attention-based counterparts. On the other hand, the community has built up a considerable body of knowledge on how to train Transformers, and many pretrained Transformer models are readily available.