Beyond Black-Box Labels: Interpretable Criteria for Diagnosing SubjectiveNLP Tasks

ArXi:2604.17022v1 Announce Type: new Subjective NLP datasets typically aggregate annotator judgments into a single gold label, making it difficult to diagnose whether disagreement reflects unclear criteria, collapsed distinctions, or legitimate plurality. We propose a \emph{schema-level diagnostic} for auditing expert-designed annotation schemas \emph{prior to} gold-label commitment, using only multi-annotator criterion judgments.