AI RESEARCH

Does "Do Differentiable Simulators Give Better Policy Gradients?'' Give Better Policy Gradients?

arXiv CS.LG

ArXi:2604.18161v1 Announce Type: new In policy gradient reinforcement learning, access to a differentiable model enables 1st-order gradient estimation that accelerates learning compared to relying solely on derivative-free 0th-order estimators. However, discontinuous dynamics cause bias and undermine the effectiveness of 1st-order estimators. Prior work addressed this bias by constructing a confidence interval around the REINFORCE 0th-order gradient estimator and using these bounds to detect discontinuities.