AI RESEARCH

Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation

arXiv CS.LG

ArXi:2605.15239v1 Announce Type: new Safety alignment often improves robustness to harmful queries at the cost of reasoning ability, a tradeoff known as the safety tax. A common cause is distributional mismatch: supervised fine-tuning trains the target model on safety nstrations produced by humans, external models, or fixed self-generated traces, rather than on trajectories sampled from its own policy. We identify off-policy