LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

ArXi:2605.12493v1 Announce Type: new Long-term memory is crucial for agents in specialized web environments, where success depends on recalling interface affordances, state dynamics, workflows, and recurring failure modes. However, existing memory benchmarks for agents mostly focus on user histories, short traces, or downstream task success, leaving open how to directly evaluate whether memory systems effectively internalize environment-specific experience. To address this gap, we