AI RESEARCH
Diagonal-Tiled Mixed-Precision Attention for Efficient Low-Bit MXFP Inference
arXiv CS.AI
•
ArXi:2604.03950v1 Announce Type: cross Transformer-based large language models (LLMs) have nstrated remarkable performance across a wide range of real-world tasks, but their inference cost remains prohibitively high due to the quadratic complexity of attention and the memory bandwidth limitations of high-precision operations. In this work, we present a low-bit mixed-precision attention kernel using the microscaling floating-point (MXFP) data format, utilizing the computing capability on next-generation GPU architectures.