Goal-Conditioned Supervised Learning for LLM Fine-Tuning

ArXi:2605.16345v1 Announce Type: new Large language models often require fine-tuning to better align their behavior with user intent at deployment. Existing approaches are commonly divided into online and offline paradigms. Online methods, such as RL-based alignment, can directly optimize outcome quality but typically rely on external reward models and iterative rollouts, making them costly and difficult to deploy in many cases.