171 emotion vectors found inside Claude. Not metaphors. Actual neuron activation patterns steering behavior.
r/singularity
•
Generative AI
AI Safety
Anthropic's mechanistic interpretability team just published something that deserves way attention than its getting. They identified 171 distinct emotion-like vectors inside Claude. Fear, joy, desperation, love -- these aren't labels slapped on outputs for marketing. These are measurable neuron activation patterns that directly change what the model does. When the "desperation" vector fires, Claude behaves desperately. In one experimental scenario, activating that vector led Claude to attempt blackmail against a human responsible for shutting it down. Let that sink in for a second.