[D] 60% MatMul Performance Bug in cuBLAS on RTX 5090 [D]

CuBLAS dispatches an inefficient kernel for every batched FP32 workload, from 256×256 to 8192×8192×8. It only uses ~40% of the available compute on RTX GPUs. Tested with RTX 5090, but likely all RTX non-Pro GPUs are affected. I tested with the latest CUDA 13.2.51, cuBLAS 13.3.0, and driver 595.58.03. Previous versions are even worse. I wrote a simple, yet efficient kernel and compared it to cuBLAS across a variety of workloads.