AI RESEARCH
Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams
arXiv CS.AI
•
ArXi:2604.18901v1 Announce Type: cross Harmful intent is geometrically recoverable from large language model residual streams: as a linear direction in most layers, and as angular deviation in layers where projection methods fail. Across 12 models spanning four architectural families (Qwen2.5, Qwen3.5, Llama-3.2, Gemma-3) and three alignment variants (base, instruction-tuned, abliterated), under single-turn, English evaluation, we characterise this geometry through six direction-finding strategies.