Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic

ArXi:2605.07284v1 Announce Type: new Recent interpretability work has identified model-internal handles on post-trained behavior, including refusal directions, assistant/persona axes, and sparse chat-tuning features. These results localize where behaviors can be read out or controlled, often in middle-to-late layers. We ask how earlier computation and the late stack cooperate to turn those differences into next-token margins. To test this, we