AI RESEARCH
Probe-Based Data Attribution: Discovering and Mitigating Undesirable Behaviors in LLM Post-Training
arXiv CS.LG
•
ArXi:2602.11079v3 Announce Type: replace We propose probe-based data attribution, a method that traces behavioral changes in post-trained language models to responsible