MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

ArXi:2507.21183v5 Announce Type: replace-cross As the era of large language models (LLMs) unfolds, Preference Optimization (PO) methods have become a central approach to aligning LLMs with human preferences and improving performance. We propose Maximum a Posteriori Preference Optimization (MaPPO), a methodology for learning from preferences that explicitly incorporates prior reward knowledge into the optimization objective.