Hate Speech Detection Still Cooks (Even in 2026)

The failure case you didn’t see coming In late 2025, a major social platform quietly rolled back parts of its LLM-based moderation pipeline after internal audits revealed a systematic pattern: posts in African American Vernacular English (AAVE) were flagged at nearly three times the rate of semantically equivalent Standard American English content. The LLM reasoner, a fine-tuned GPT-4-class model had learned to treat certain phonetic spellings and grammatical constructions as proxies for “informal aggression.” A linguist reviewing the flagged corpus found no aggression whatsoever.