Training-Inference Consistent Segmented Execution for Long-Context LLMs

ArXi:2605.11744v1 Announce Type: cross Transformer-based large language models face severe scalability challenges in long-context generation due to the computational and memory costs of full-context attention. Under practical computation and memory constraints, many inference-efficient long-context methods improve efficiency by adopting bounded-context or segment-level execution only during inference, while continuing to train models under full-context attention, resulting in a mismatch between.