DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents

ArXi:2604.24320v1 Announce Type: new Large language model (LLM) agents that follow the sequential "reason-then-act" paradigm have achieved superior performance in many complex tasks. However, these methods suffer from limited exploration and incomplete environmental understanding, as they interact with only a single environment per step. In this paper, we first