Pessimistic Risk-Aware Policy Learning in Contextual Bandits

ArXi:2605.15620v1 Announce Type: cross We study risk-aware offline policy learning, aiming to learn a decision rule from logged data that is optimal under general risk criteria. This problem is crucial in high-stakes domains where online interaction is infeasible and adverse outcomes must be carefully controlled. However, existing literature on offline contextual bandits either centers on expected-reward criteria or restricts risk considerations to policy evaluation instead of optimization.