Graph-Regularized Sparse Autoencoders for LLM Safety Steering

ArXi:2512.06655v3 Announce Type: replace-cross Sparse autoencoders (SAEs) are increasingly used to extract activation directions for inference-time steering, but their standard sparsity objective treats latent features as independent. This prior can be poorly matched to high-level safety behaviors, where refusal and harmful compliance appear to depend on distributed structure in activation space. We