StressWeb: A Diagnostic Benchmark for Web Agent Robustness under Realistic Interaction Variability

ArXi:2604.16385v1 Announce Type: cross Large language model-based web agents have nstrated strong performance on realistic web interaction tasks. However, existing evaluations are predominantly conducted under relatively stable and well-behaved interaction conditions, which may overestimate agent robustness. High task success in such idealized settings does not necessarily reflect performance under realistic web interaction. To address this limitation, we