Beyond Accuracy: An Explainability-Driven Analysis of Harmful Content Detection

ArXi:2603.18015v1 Announce Type: cross Although automated harmful content detection systems are frequently used to monitor online platforms, moderators and end users frequently cannot understand the logic underlying their predictions. While recent studies have focused on increasing classification accuracy, little focus has been placed on comprehending why neural models identify content as harmful, especially when it comes to borderline, contextual, and politically sensitive situations.