Pseudo-Quantized Actor-Critic Algorithm for Robustness to Noisy Temporal Difference Error

ArXi:2604.01613v1 Announce Type: new In reinforcement learning (RL), temporal difference (TD) errors are widely adopted for optimizing value and policy functions. However, since the TD error is defined by a bootstrap method, its computation tends to be noisy and destabilize learning. Heuristics to improve the accuracy of TD errors, such as target networks and ensemble models, have been