Compressed-Sensing-Guided, Inference-Aware Structured Reduction for Large Language Models

ArXi:2604.14156v1 Announce Type: new Large language models deliver strong generative performance but at the cost of massive parameter counts, memory use, and decoding latency. Prior work has shown that pruning and structured sparsity can preserve accuracy under substantial compression, while prompt-compression methods reduce latency by removing redundant input tokens. However, these two directions remain largely separate.