AI RESEARCH

Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies

arXiv CS.AI

ArXi:2604.09189v1 Announce Type: cross LLMs internalize safety policies through RLHF, yet these policies are never formally specified and remain difficult to inspect. Existing benchmarks evaluate models against external standards but do not measure whether models understand and enforce their own stated boundaries. We