FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning

ArXi:2601.18150v2 Announce Type: replace Reinforcement learning (RL) for large language models (LLMs) is increasingly bottlenecked by rollout (generation), where long output sequence lengths make attention and KV-cache memory dominate end-to-end step time. FP8 offers an attractive lever for accelerating RL by reducing compute cost and memory traffic during rollout, but applying FP8 in RL