WMB-100K – open source benchmark for AI memory systems at 100K turns

Been thinking about how AI memory systems are only ever tested at tiny scales - LOCOMO does 600 turns, LongMemEval does around 1,000. But real usage doesn't look like that. WMB-100K tests 100,000 turns, with 3,134 questions across 5 difficulty levels. Also includes false memory probes - because "I don't know" is fine, but confidently giving wrong info is a real problem. Dataset's included, costs about $0.07 to run. Curious to see how different systems perform. GitHub link in the comments. submitted by /u/Efficient_Joke3384 [link] [comments.