NanoQuant: Efficient Sub-1-Bit Quantization of Large Language Models

ArXi:2602.06694v2 Announce Type: replace Weight-only quantization has become a standard approach for efficiently serving large language models (LLMs). However, existing methods fail to efficiently compress models to binary (1-bit) levels, as they either require large amounts of data and compute or incur additional storage. In this work, we propose NanoQuant, the first post-