Reference-guided Policy Optimization for Molecular Optimization via LLM Reasoning

ArXi:2603.05900v1 Announce Type: cross Large language models (LLMs) benefit substantially from supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) in reasoning tasks. However, these recipes perform poorly in instruction-based molecular optimization, where each data point typically provides only a single optimized reference molecule and no step-by-step optimization trajectory.