AI RESEARCH
Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture
arXiv CS.AI
•
ArXi:2604.23646v1 Announce Type: new Recent evidence suggests that frontier AI systems can exhibit agentic misalignment, generating and executing harmful actions derived from internally constructed goals, even without explicit user requests. Existing mitigation methods, such as Reinforcement Learning from Human Feedback (RLHF) and constitutional prompting, operate primarily at the model level and provide only probabilistic safety guarantees. We propose the Policy-Execution-Authorization (PEA) architecture, a "separation-of-powers" design that enforces safety at the system level.