Theory of Mind and Self-Attributions of Mentality are Dissociable in LLMs

ArXi:2603.28925v1 Announce Type: cross Safety fine-tuning in Large Language Models (LLMs) seeks to suppress potentially harmful forms of mind-attribution such as models asserting their own consciousness or claiming to experience emotions. We investigate whether suppressing mind-attribution tendencies degrades intimately related socio-cognitive abilities such as Theory of Mind (ToM