AI RESEARCH
Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes
arXiv CS.LG
•
ArXi:2605.03160v1 Announce Type: new The standard sparse-autoencoder (SAE) interpretability protocol labels each feature from its top-activating contexts and validates by single-feature steering. We propose the pairwise matrix protocol, co-varying steering coefficient with joint condition, and report three findings the standard one-corner protocol misses on Qwen3-1.7B-Instruct, replicated on Gemma-2-2B-it.