AI RESEARCH
PreFT: Prefill-only finetuning for efficient inference
arXiv CS.LG
•
ArXi:2605.14217v1 Announce Type: new Large language models can now be personalised efficiently at scale using parameter efficient finetuning methods (PEFTs), but serving user-specific PEFTs harms throughput, even with specialised kernels and memory management techniques. This is because, theoretically and empirically, a mismatch exists between prefill (processing a large number of tokens at once) and decode (generating a single token autoregressively): the latter has far lower throughput when serving multiple adapters.