AI RESEARCH

Direction-Flipped Influence Audits Reveal Hidden Structure in Moral Choices of LLMs

arXiv CS.AI

ArXi:2602.22831v2 Announce Type: replace-cross Moral benchmarks for LLMs typically score models on context-free prompts, implicitly treating the measured choice rate as stable. We test this assumption with a direction-flipped influence audit: for each scenario, we compare a baseline prompt with matched cues steering toward option A or option B. Across a trolley-problem-style moral triage task, BBQ, and DailyDilemmas, and across five LLM families with and without reasoning, short contextual cues shift per-condition choice rates by 12-18%age points on average.