Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning

ArXi:2602.14868v2 Announce Type: replace-cross Reinforcement learning has emerged as a powerful paradigm for unlocking reasoning capabilities in language models. However, relying on sparse rewards makes this process highly sample-inefficient, as models must navigate vast search spaces with minimal feedback. While classic curriculum learning aims to mitigate this by ordering data based on complexity, prior works have primarily targeted small datasets and do not directly transfer to the large-scale settings typical of modern LM