Playing the network backward: A Game Theoretic Attribution Framework

ArXi:2605.06212v1 Announce Type: new Attribution methods explain which input features drive a model's prediction, making them central to model debugging and mechanistic interpretability. Yet backward attribution methods, including gradients, LRP, and transformer-specific rules, lack a shared framework in which to compare the underlying backward calculations. We