Owen-Shapley Policy Optimization: A Principled RL Algorithm for Generative Search LLMs

ArXi:2601.08403v2 Announce Type: replace Large language models are increasingly trained via reinforcement learning for personalized recommendation tasks, but standard methods like GRPO rely on sparse, sequence-level rewards. These obscure which tokens actually contribute to high-quality outputs, creating a credit assignment gap. This gap is especially problematic when models must infer latent user intent from under-specified language without ground truth labels, which is a reasoning pattern rarely seen during pre.