AI RESEARCH

[D] MXFP8 GEMM: Up to 99% of cuBLAS performance using CUDA + PTX

r/MachineLearning

New blog post by Daniel Vega-Myhre (Meta/PyTorch) illustrating GEMM design for FP8, including deep-dives into all the constraints and design challenges