Beyond Semantic Manipulation: Token-Space Attacks on Reward Models

ArXi:2604.02686v1 Announce Type: cross Reward models (RMs) are widely used as optimization targets in reinforcement learning from human feedback (RLHF), yet they remain vulnerable to reward hacking. Existing attacks mainly operate within the semantic space, constructing human-readable adversarial outputs that exploit RM biases. In this work, we