ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (… by lnigam · Pull Request #22286 · ggml-org/llama.cpp

r/LocalLLaMA
Generative AI AI Hardware Open Source AI

Improves the speed of Mistral Small 4 on CUDA (there was a CPU fallback before) (I wonder if it’s somehow related to the upcoming Mistral model? Maybe not) submitted by /u/jacek2023 [link] [comments]