Post-SFT Alignment with DPO and GRPO : How to Fine-Tune Correctly, Part 6

SFT taught your model what to say. This episode teaches it what to prefer. Generated using notebookLM Most engineers who fine-tune language models hit a wall they do not see coming. The SFT loss converges. The format compliance looks right. The outputs are coherent. And then the model goes into production and starts producing the same three sentence structures in rotation, collapses the moment a prompt has two reasonable interpretations, and sounds like it is reading from a script rather than reasoning through a problem. This is not a hyperparameter issue. It is not a data quality issue.