We audited LoCoMo: 6.4% of the answer key is wrong and the judge accepts up to 63% of intentionally wrong answers

r/LocalLLaMA
Generative AI AI Research

Projects are still submitting new scores on LoCoMo as of March 2026. but the benchmark is deeply flawed. We audited it and found 6.4% of the answer key is wrong, and the LLM judge accepts up to 63% intentionally wrong answers. LongMemEval-S fits entirely in modern context windows, making it of a context window test than a memory test. Here's what we found. LoCoMo LoCoMo ( Maharana, ACL 2024 ) is one of the most widely cited memory benchmarks. We did a systematic audit of the ground truth and found 99 score-corrupting errors in 1,540 questions (6.4.