AI RESEARCH

LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users

arXiv CS.LG

ArXi:2507.02850v3 Announce Type: replace-cross We describe a vulnerability in language models (LMs) trained with user feedback, whereby a single user can persistently alter LM knowledge and behavior given only the ability to provide prompts and upvote / downvote feedback on LM outputs. To implement the attack, the attacker prompts the LM to stochastically output either a "poisoned" or benign response, then upvotes the poisoned response or downvotes the benign one.