Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
Dev.to AI
•
Generative AI
AI Research
{{ $json.postContent