AI RESEARCH

[D] Has interpretability research been applied to model training?

r/MachineLearning

A recent X post by Goodfire shows that attention probes can be used to reduce token costs by enabling early CoT exits. This seems to be an interesting use case of attention probes and I am wondering if these techniques have been applied to the models themselves during either pre-