Ambig-DS: A Benchmark for Task-Framing Ambiguity in Data-Science Agents

ArXi:2605.09698v1 Announce Type: new As data-science agents shift from co-pilots to auto-pilots, silent misframing becomes a critical failure mode. Agents quietly commit to plausible but unintended task framings, producing clean, executable artifacts that hide their incorrect assessment of the task. Existing benchmarks score whether the pipeline runs, ignoring whether the agent recognized the task was underspecified. We