AI RESEARCH

Interactionless Inverse Reinforcement Learning: A Data-Centric Framework for Durable Alignment

arXiv CS.LG

ArXi:2602.14844v2 Announce Type: replace AI alignment is growing in importance, yet many current approaches learn safety behavior by directly modifying policy parameters, entangling normative constraints with the underlying policy. This often yields opaque, difficult-to-edit alignment artifacts and reduces their reuse across models or deployments, a failure mode we term Alignment Waste. We propose Interactionless Inverse Reinforcement Learning, a framework for learning inspectable, editable, and reusable reward artifacts separately from policy optimization. We further.