MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness

ArXi:2601.08118v3 Announce Type: replace-cross Large language models (LLMs) are increasingly used as human simulators, both for evaluating conversational systems and for generating fine-tuning data. However, naive "act-as-a-user" prompting often yields verbose, unrealistic utterances, motivating principled evaluation of *user proxy agents