Trained a Qwen2.5-0.5B-Instruct bf16 model on Reddit post summarization task with GRPO [P]

So, a few days back I shared a post where I trained a tiny Qwen2.5-0.5B-Instruct model on smoltldr (reddit post summarization dataset of 2k rows), to output summaries of about 64 max length using RLVR with GRPO. However, there was a catch! The wandb charts for avg response length was going down and saturated around 10-15 tokens on an avg.