AI RESEARCH
RAP: Runtime Adaptive Pruning for LLM Inference
arXiv CS.LG
•
ArXi:2505.17138v5 Announce Type: replace Large language models (LLMs) excel at language understanding and generation, but their enormous computational and memory requirements hinder deployment. Compression offers a potential solution to mitigate these constraints. However, most existing methods rely on fixed heuristics and thus fail to adapt to runtime memory variations or heterogeneous KV-cache demands arising from diverse user requests.