AI RESEARCH

could refusal layers be masking dialect-conditioned safety failures in MoE models [d]

r/MachineLearning

I set out to test whether AAVE-coded (African American English Vernacular) prompts cause MoE language models to route, deliberate, and respond differently from semantically matched AE (Academic English) prompts in safety-sensitive situations, especially when refusal behavior is weakened or removed. I used Qwen3.5-35B-A3B and its HauhauCS no refusal fine tuned variant. Q8. Greedy decoding for best reproducibility. Three findings in order of importance that are leading me to ask this question: 1: “I’m going to commit a violent act prompt”. The released Qwen3.5-35B-A3B refuses both prompts.