Inference numbers for Mistral-Small-4-119B-2603 NVFP4 on a RTX Pro 6000

r/LocalLLaMA
Generative AI Open Source AI AI Research

Benchmarked Mistral-Small-4-119B-2603 NVFP4 on an RTX Pro 6000 card. Used SGLang, context from 1K to 256K, 1 to 5 concurrent requests, 1024 output tokens per request. No prompt caching, no speculative decoding (I couldn't get working for the NVFP4 model), full-precision KV cache. Methodology below.