AI RESEARCH

Faithful Bi-Directional Model Steering via Distribution Matching and Distributed Interchange Interventions

arXiv CS.LG

ArXi:2602.05234v2 Announce Type: replace Intervention-based model steering offers a lightweight and interpretable alternative to prompting and fine-tuning. However, by adapting strong optimization objectives from fine-tuning, current methods are susceptible to overfitting and often underperform, sometimes generating unnatural outputs. We hypothesize that this is because effective steering requires the faithful identification of internal model mechanisms, not the enforcement of external preferences.