AI RESEARCH
Rate-Distortion Optimization for Transformer Inference
arXiv CS.LG
•
ArXi:2601.22002v2 Announce Type: replace Transformers achieve superior performance on many tasks, but impose heavy compute and memory requirements during inference. This inference can be made efficient by partitioning the process across multiple devices, which, in turn, requires compressing its intermediate representations. We