DROP: Distributional and Regular Optimism and Pessimism for Reinforcement Learning

ArXi:2410.17473v2 Announce Type: replace In reinforcement learning (RL), temporal difference (TD) error is known to be related to the firing rate of dopamine neurons. It has been observed that each dopamine neuron does not behave uniformly, but each responds to the TD error in an optimistic or pessimistic manner, interpreted as a kind of distributional RL. To explain such a biological data, a heuristic model has also been