ggml : add NVFP4 quantization type support
r/LocalLLaMA
•
Generative AI
AI Hardware
Open Source AI
It's available b8297 onwards. Get latest llama.cpp version. This adds for NVIDIA's NVFP4 quantization format (FP4 E2M1 weights, UE4M3 per-block scale, 16 elements per block). This is the format produced by NVIDIA ModelOpt's NVFP4 algo. The main difference is the scale encoding (UE4M3 vs E8M0