Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities