AI RESEARCH

Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]

r/MachineLearning

Hi folks, I’m an undergrad doing some research on temporal credit assignment, and I recently ran into a frustrating issue. Trying to fuse multi-timescale advantages (like γ = 0.5, 0.9, 0.99, 0.999) inside an Actor-Critic architecture usually leads to irreversible policy collapse or really weird local optima. I spent some time diagnosing exactly why this happens, and it boils down to two main optimization pathologies: Surrogate Objective Hacking: When the temporal attention mechanism is exposed to policy gradients, the optimizer just finds a shortcut.