ReLaX: Reasoning with Latent Exploration for Large Reasoning Models

ArXi:2512.07558v2 Announce Type: replace Reinforcement Learning with Verifiable Rewards (RLVR) has recently nstrated remarkable potential in enhancing the reasoning capability of Large Reasoning Models (LRMs). However, RLVR often drives the policy toward over-determinism, resulting in ineffective exploration and premature policy convergence.