Measurement Risk in Supervised Financial NLP: Rubric and Metric Sensitivity on JF-ICR

ArXi:2604.27374v1 Announce Type: new As LLMs become credible readers of earnings calls, investor-relations Q\&A, guidance, and disclosure language, supervised financial NLP benchmarks increasingly function as decision evidence for model selection and deployment. A hidden assumption is that gold labels make such evidence objective. This assumption breaks down when the benchmark ruler itself is sensitive to rubric wording, metric choice, or aggregation policy.