AI SAFETY & ETHICS
An Introduction to Exemplar Partitioning for Mechanistic Interpretability
LessWrong AI
•
Most of what we currently call "feature discovery" in language models is wrapped up in dictionary-learning methods like sparse autoencoders (SAEs) - which work, and which have been scaled to millions of features on frontier-scale models, but which bundle two distinct commitments into a single