Hit 90.4% on LongMemEval-S with structured storage - no embeddings, ~half the tokens, 98% retrieval accuracy

Solo de, been working on this on the side during first year uni, 10/500 questions were missing context to answer and the rest were model misusing context so going to keep iterating to hit top of the leaderboard. I know its closed source so not reproducible and hard to trust so I made a bench viewer where you can see all 500 questions sorted by category + pass/fail, with ground truth, question, c137 response, and fails bucketed into model-fails vs retrieval-fails. Switch between the 3 answerer models. Grading script is the official one from the bench repo, linked there.