Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment

ArXi:2605.13537v1 Announce Type: cross Inference-time alignment techniques offer a lightweight alternative or complement to costly reinforcement learning, while enabling continual adaptation as alignment objectives and reward targets evolve. Existing theoretical analyses justify these methods as approximations to sampling from distributions optimally tilted toward a given reward model. We extend these techniques by