AI RESEARCH

MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference

arXiv CS.LG

ArXi:2604.21026v1 Announce Type: new Deploying large language models to heterogeneous hardware is often constrained by memory, not compute. We