AI RESEARCH

StepFly: Agentic Troubleshooting Guide Automation for Incident Diagnosis

arXiv CS.AI

ArXi:2510.10074v2 Announce Type: replace Effective incident management in large-scale IT systems relies on troubleshooting guides (TSGs), but their manual execution is slow and error-prone. While recent advances in LLMs offer promise for automating incident management tasks, existing LLM-based solutions lack specialized for several key challenges, including managing TSG quality issues, interpreting complex control flow, handling data-intensive queries, and exploiting execution parallelism.