Is One Score Enough? Rethinking the Evaluation of Sequentially Evolving LLM Memory

ArXi:2605.15384v1 Announce Type: cross Memory plays a central role in enabling large language models (LLMs) to operate over sequential tasks by accumulating and reusing experience over time. However, existing evaluations of LLM memory mostly rely on aggregate metrics such as final hold-out accuracy or cumulative online performance, which can obscure critical failure modes such as forgetting and negative transfer. In this paper, we