TLPO: Token-Level Policy Optimization for Mitigating Language Confusion in Large Language Models

ArXi:2604.26553v1 Announce Type: cross Large language models (LLMs) nstrate strong multilingual capabilities, yet often fail to consistently generate responses in the intended language, exhibiting a phenomenon known as language confusion. Prior mitigation approaches based on sequence-level fine-tuning, such as DPO, ORPO, and GRPO, operate at the level of entire responses and can lead to unintended degradation of general model capabilities, motivating the need for fine-grained alternatives. To address this, we.