Detection of adversarial intent in Human-AI teams using LLMs

ArXi:2603.20976v1 Announce Type: cross Large language models (LLMs) are increasingly deployed in human-AI teams as agents for complex tasks such as information retrieval, programming, and decision-making assistance. While these agents' autonomy and contextual knowledge enables them to be useful, it also exposes them to a broad range of attacks, including data poisoning, prompt injection, and even prompt engineering. Through these attack vectors, malicious actors can manipulate an LLM agent to provide harmful information, potentially manipulating human agents to make harmful decisions.