AI RESEARCH
Evaluating Model-Free Policy Optimization in Masked-Action Environments via an Exact Blackjack Oracle
arXiv CS.LG
•
ArXi:2603.18642v1 Announce Type: new Infinite-shoe casino blackjack provides a rigorous, exactly verifiable benchmark for discrete stochastic control under dynamically masked actions. Under a fixed Vegas-style ruleset (S17, 3:2 payout, dealer peek, double on any two, double after split, resplit to four), an exact dynamic programming (DP) oracle was derived over 4,600 canonical decision cells. This oracle yielded ground-truth action values, optimal policy labels, and a theoretical expected value (EV) of -0.00161 per hand.