The Impact of Off-Policy Training Data on Probe Generalisation

ArXi:2511.17408v4 Announce Type: replace-cross Probing has emerged as a promising method for monitoring large language models (LLMs), enabling cheap inference-time detection of concerning behaviours. However, natural examples of many behaviours are rare, forcing researchers to rely on synthetic or off-policy LLM responses for