Humanline: Online Alignment as Perceptual Loss

ArXi:2509.24207v2 Announce Type: replace Online alignment (e.g., GRPO) is generally performant than offline alignment (e.g., DPO) -- but why? Drawing on prospect theory from behavioral economics, we propose a human-centric explanation. We prove that online on-policy sampling better approximates the human-perceived distribution of what the model can produce, and PPO/GRPO-style clipping -- originally