From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs

ArXi:2604.14137v1 Announce Type: cross Evaluating LLMs is challenging, as benchmark scores often fail to capture models' real-world usefulness. Instead, users often rely on ``vibe-testing'': informal experience-based evaluation, such as comparing models on coding tasks related to their own workflow. While prevalent, vibe-testing is often too ad hoc and unstructured to analyze or reproduce at scale. In this work, we study how vibe-testing works in practice and then formalize it to systematic analysis.