What should post-training optimize? A test-time scaling law perspective

ArXi:2605.10716v1 Announce Type: new Large language models are increasingly deployed with test-time strategies: sample $N$ responses, score them with a reward model or verifier, and return the best. This deployment rule exposes a mismatch in post-