From the Inside Out: Progressive Distribution Refinement for Confidence Calibration

ArXi:2603.16500v1 Announce Type: new Leveraging the model's internal information as the self-reward signal in Reinforcement Learning (RL) has received extensive attention due to its label-free nature. While prior works have made significant progress in applying the Test-Time Scaling (TTS) strategies to RL, the discrepancy in internal information between test and