Rate-Distortion Optimization for Transformer Inference

ArXi:2601.22002v2 Announce Type: replace Transformers achieve superior performance on many tasks, but impose heavy compute and memory requirements during inference. This inference can be made efficient by partitioning the process across multiple devices, which, in turn, requires compressing its intermediate representations. We