GAVEL: Towards Rule-Based Safety Through Activation Monitoring

ArXi:2601.19768v3 Announce Type: replace Large language models (LLMs) are increasingly paired with activation-based monitoring to detect and prevent harmful behaviors that may not be apparent at the surface-text level. However, existing activation safety approaches, trained on broad misuse datasets, struggle with poor precision, limited flexibility, and lack of interpretability. This paper