MemAware benchmark shows that RAG-based agent memory fails on implicit context — search scores 2.8% vs 0.8% with no memory

Built a benchmark that tests something none of the existing memory benchmarks test: can an AI agent surface relevant past context when the user doesn't ask about it? Most agent memory systems work like this: user asks something → agent searches memory → retrieves results → answers.