AI RESEARCH
MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference
arXiv CS.LG
•
ArXi:2604.21026v1 Announce Type: new Deploying large language models to heterogeneous hardware is often constrained by memory, not compute. We