AI RESEARCH

PreFT: Prefill-only finetuning for efficient inference

arXiv CS.LG

ArXi:2605.14217v1 Announce Type: new Large language models can now be personalised efficiently at scale using parameter efficient finetuning methods (PEFTs), but serving user-specific PEFTs harms throughput, even with specialised kernels and memory management techniques. This is because, theoretically and empirically, a mismatch exists between prefill (processing a large number of tokens at once) and decode (generating a single token autoregressively): the latter has far lower throughput when serving multiple adapters.