How do you benchmark structural properties of agent memory (isolation, context pollution, typed memory) beyond retrieval metrics? [D]

I'm working on an open-source memory infrastructure for AI agents ( CtxVault ). It organizes agent memory into typed, isolated vaults rather than a single shared vector. I've run standard retrieval benchmarks (BEIR, CoIR) comparing against raw ChromaDB and LangChain and confirmed the vault abstraction adds no retrieval overhead. That part is straightforward. The part I'm stuck on is how to benchmark the properties that actually differentiate the system. There are two main claims I want to evaluate: First, context isolation.