Do Activation Verbalization Methods Convey Privileged Information?

ArXi:2509.13316v4 Announce Type: replace-cross Recent interpretability methods have proposed to translate LLM internal representations into natural language descriptions using a second verbalizer LLM. This is intended to illuminate how the target model represents and operates on inputs.