AI RESEARCH

Budgeted LoRA: Distillation as Structured Compute Allocation for Efficient Inference

arXiv CS.AI

ArXi:2605.04341v1 Announce Type: cross We study distillation for large language models under explicit compute constraints, with the goal of producing student models that are not only cheaper to train, but structurally efficient at inference time. While prior approaches to parameter-efficient distillation, such as LoRA, reduce adaptation cost, they leave the dense backbone unchanged and. therefore. fail to deliver meaningful inference savings. We propose Budgeted LoRA, a distillation framework that treats model compression as a structured compute allocation problem.