MXNorm: Reusing MXFP block scales for efficient tensor normalisation

ArXi:2603.13180v1 Announce Type: cross Matrix multiplication performance has long been the major bottleneck to scaling deep learning workloads, which has stimulated the design of new accelerators that use increasingly low-precision number formats. However, improvements in matrix multiplication performance have far outstripped improvements in performance on reductions and elementwise computations, which are still being performed in higher precision.