AI RESEARCH

When GPUs Fail Quietly: Observability-Aware Early Warning Beyond Numeric Telemetry

arXiv CS.LG

ArXi:2603.28781v1 Announce Type: cross GPU nodes are central to modern HPC and AI workloads, yet many failures do not manifest as immediate hard faults. While some instabilities emerge gradually as weak thermal or efficiency drift, a significant class occurs abruptly with little or no numeric precursor. In these detachment-class failures, GPUs become unavailable at the driver or interconnect level and the dominant observable signal is structural, including disappearance of device metrics and degradation of monitoring payload integrity.