An Empirical Study of SFT-DPO Interaction and Parameterization in Small Language Models

ArXi:2603.20100v1 Announce Type: cross Direct Preference Optimization (DPO) is widely used after supervised fine-tuning (SFT) to align language models, yet empirical behavior under small backbones and modest data is under-specified. We systematically compare SFT-only, DPO-only, and staged SFT-to-DPO