AI RESEARCH

Test-Time Safety Alignment

arXiv CS.AI

ArXi:2604.26167v1 Announce Type: cross Recent work has shown that a model's input word embeddings can serve as effective control variables for steering its behavior toward outputs that satisfy desired properties. However, this has only been nstrated for pretrained text-completion models on the relatively simple objective of reducing surface-level profanity in short continuations.