FeatherOps: Fast fp8 matmul on RDNA3 without native fp8

r/LocalLLaMA
Generative AI

I'm working on it in ComfyUI, and the kernel can also be used in LLM