[D] The problem with comparing AI memory system benchmarks — different evaluation methods make scores meaningless

I've been reviewing how various AI memory systems evaluate their performance and noticed a fundamental issue with cross-system comparison. Most systems benchmark on LOCOMO (Maharana, ACL 2024), but the evaluation methods vary significantly. LOCOMO's official metric (Token-Overlap F1) gives GPT-4 full context 32.1% and human performance 87.9%. However, memory system developers report scores of 60-67% using custom evaluation criteria such as retrieval accuracy or keyword matching rather than the original F1 metric.