SEARL: Joint Optimization of Policy and Tool Graph Memory for Self-Evolving Agents

ArXi:2604.07791v2 Announce Type: replace Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have nstrated significant potential in single-turn reasoning tasks. With the paradigm shift toward self-evolving agentic learning, models are increasingly expected to learn from trajectories by synthesizing tools or accumulating explicit experiences. However, prevailing methods typically rely on large-scale LLMs or multi-agent frameworks, which hinder their deployment in resource-constrained environments.