MUXQ: Mixed-to-Uniform Precision MatriX Quantization via Low-Rank Outlier Decomposition

ArXi:2604.04701v1 Announce Type: cross Large language models (LLMs) have achieved outstanding performance across a wide range of natural language processing tasks, but their enormous parameter counts impose ubstantial memory and computational overheads. This challenge is particularly critical in NPU-based on-device environments, where FP16/FP32 computation is inefficient and integer (INT) quantization is. therefore. essential. However, existing methods, including ZeroQuant, LLM.int8, and SmoothQuant, do not fully address input-activation outliers and the associated hardware inefficiencies.