AI Model Reviews

LLM benchmarks are terrible. Everyone overfits their models so they can max out benchmarks in no than a few months after its release. Open source models release with headlines "90% of Opus at 5% of the cost", yet anyone who has actually used it can feel the obvious difference in quality. So now that benchmarks mean nothing, it has become impossible to find good reviews on models any more. Every result on the google search "minimax m2.7 review" is either AI-written slop blogposts made in 10 minutes. These are the worst. Meaningless benchmark results.