Confident, Calibrated, or Complicit: Safety Alignment and Ideological Bias in LLM Hate Speech Detection

ArXi:2509.00673v2 Announce Type: replace We investigate the efficacy of Large Language Models (LLMs) in detecting implicit and explicit hate speech, examining how models with minimal safety alignment (uncensored) compare with heavily aligned (censored) counterparts in a deployed-model setting when deployed using political personas. While uncensored models are often framed as offering a less constrained perspective, our results reveal a trade-off: censored models outperform their uncensored counterparts in both accuracy and robustness, achieving 69.0\% versus 64.1\% strict accuracy.