The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next

ArXi:2605.18840v1 Announce Type: cross Leaderboards rank frontier models on independent axes but do not reveal whether capabilities reinforce or trade off across releases -- and at the frontier, this interaction is the informative signal. We decompose paired SWE-bench and GPQA Diamond scores into a population coupling trend and per-release residual ($h$-field) that diagnoses capability emphasis and identifies which measurement or stress test is most informative next.