AI RESEARCH

Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start

arXiv CS.LG

ArXi:2604.06664v1 Announce Type: cross Modern LLM service providers increasingly rely on autoscaling and parallelism reconfiguration to respond to rapidly changing workloads, but cold-start latency remains a major bottleneck. While recent systems have reduced model weight loading to seconds, CUDA graph capture still takes tens of seconds to minutes and often dominates startup.