Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

ArXi:2509.26238v4 Announce Type: replace Monitoring large language models' (LLMs) activations is an effective way to detect harmful requests before they lead to unsafe outputs. However, traditional safety monitors often require the same amount of compute for every query. This creates a trade-off: expensive monitors waste resources on easy inputs, while cheap ones risk missing subtle cases. We argue that safety monitors should be flexible--costs should rise only when inputs are difficult to assess, or when compute is available. To achieve this, we