What If LLMs Learn Better from Language Than from Rewards?

Rethinking LLM Optimization through TextGrad, MIPRO, and GEPA For the past couple of years, reinforcement learning (RL) has been the dominant paradigm for adapting large language models (LLMs) to downstream tasks. Methods like PPO and GRPO treat model behavior as a policy and rely on scalar rewards to guide learning. But there’s a mismatch hiding beneath this success: LLMs operate in language, yet we reward them with numbers.