ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training

ArXi:2603.29871v1 Announce Type: new In user-agent interaction scenarios such as recommendation, brainstorming, and code suggestion, Large Language Models (LLMs) often generate sets of candidate recommendations where the objective is to maximize the collective utility of the entire set rather than individual candidates independently. However, existing reinforcement learning post-