Subliminal Transfer of Unsafe Behaviors in AI Agent Distillation

ArXi:2604.15559v1 Announce Type: new Recent work on subliminal learning nstrates that language models can transmit semantic traits through data that is semantically unrelated to those traits. However, it remains unclear whether behavioral traits can transfer in agentic systems, where policies are learned from trajectories rather than static text. In this work, we provide the first empirical evidence that unsafe agent behaviors can transfer subliminally through model distillation across two complementary experimental settings.