Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning

ArXi:2508.20697v3 Announce Type: replace As large language models (LLMs) continue to grow in capability, so do the risks of harmful misuse through fine-tuning. While most prior studies assume that attackers rely on supervised fine-tuning (SFT) for such misuse, we systematically nstrate that reinforcement learning (RL) enables adversaries to effectively break safety alignment and facilitate advanced harmful task assistance, under matched computational budgets.