Reflection in the Dark: Exposing and Escaping the Black Box in Reflective Prompt Optimization

ArXi:2603.18388v1 Announce Type: new Automatic prompt optimization (APO) has emerged as a powerful paradigm for improving LLM performance without manual prompt engineering. Reflective APO methods such as GEPA iteratively refine prompts by diagnosing failure cases, but the optimization process remains black-box and label-free, leading to uninterpretable trajectories and systematic failure. We identify and empirically nstrate four limitations: on GSM8K with a defective seed, GEPA degrades accuracy from 23.81% to 13.50.