Test-Time Safety Alignment

ArXi:2604.26167v1 Announce Type: cross Recent work has shown that a model's input word embeddings can serve as effective control variables for steering its behavior toward outputs that satisfy desired properties. However, this has only been nstrated for pretrained text-completion models on the relatively simple objective of reducing surface-level profanity in short continuations.