Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks

ArXi:2604.17761v1 Announce Type: cross Interpretability tools are increasingly used to analyze failures of Large Language Models (LLMs), yet prior work largely focuses on short prompts or toy settings, leaving their behavior on commonly used benchmarks underexplored. To address this gap, we study contrastive, LRP-based attribution as a practical tool for analyzing LLM failures in realistic settings.