AI RESEARCH
How Log-Barrier Helps Exploration in Policy Optimization
arXiv CS.AI
•
ArXi:2603.15001v1 Announce Type: cross Recently, it has been shown that the Stochastic Gradient Bandit (SGB) algorithm converges to a globally optimal policy with a constant learning rate. However, these guarantees rely on unrealistic assumptions about the learning process, namely that the probability of the optimal action is always bounded away from zero. We attribute this to the lack of an explicit exploration mechanism in