AI RESEARCH
Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation
arXiv CS.LG
•
ArXi:2605.15239v1 Announce Type: new Safety alignment often improves robustness to harmful queries at the cost of reasoning ability, a tradeoff known as the safety tax. A common cause is distributional mismatch: supervised fine-tuning trains the target model on safety nstrations produced by humans, external models, or fixed self-generated traces, rather than on trajectories sampled from its own policy. We identify off-policy