Efficient LLM Reasoning via Variational Posterior Guidance with Efficiency Awareness

ArXi:2605.11019v1 Announce Type: cross Although large language models rely on chain-of-thought for complex reasoning, the overthinking phenomenon severely degrades inference efficiency. Existing reinforcement learning methods compress reasoning chains by designing elaborate reward functions, which renders high-quality samples extremely sparse in the exploration space and creates a sampling bottleneck for the prior policy.