Learning When Not to Learn: Risk-Sensitive Abstention in Bandits with Unbounded Rewards

ArXi:2510.14884v2 Announce Type: replace-cross In high-stakes AI applications, even a single action can cause irreparable damage. However, nearly all of sequential decision-making theory assumes that all errors are recoverable (e.g., by bounding rewards). Standard bandit algorithms that explore aggressively may cause irreparable damage when this assumption fails. Some prior work avoids irreparable errors by asking for help from a mentor, but a mentor may not always be available.