Quantization and Fast Inference (MEAP) - How much performance are you actually getting from quantization in production? [D]

Hi all, Stjepan from Manning here. The mods said it's fine if I post this here. I wanted to share a new MEAP (early access) release we think will land well with people here: Quantization and Fast Inference by Kalyan Aranganathan: Quantization and Fast Inference A lot of ML deployment discussions still revolve around model quality first and infrastructure second. Then the bill shows up. Or latency becomes unacceptable. Or the model that worked fine on A100s suddenly needs to run somewhere much smaller.