Negative Advantage Is a Double-Edged Sword: Calibrating Advantage in GRPO for Deep Search

ArXi:2604.18235v1 Announce Type: new Deep search agents can autonomously initiate multi-turn interactions with search engines, thereby exhibiting strong question-answering capabilities. Such performance critically relies on Group Relative Policy Optimization (GRPO) as its core