AI RESEARCH
Effective and Memory-Efficient Alternatives to ECC for Reliable Large-Scale DNNs
arXiv CS.LG
•
ArXi:2605.07417v1 Announce Type: cross Modern Deep Learning (DL) workloads are increasingly deployed in safety-critical domains, such as automotive systems and hyperscale data centers, where transient hardware faults pose a serious threat to system reliability. These workloads are highly memory-intensive, and their correct functionality strongly depends on model parameters d in memory, which are typically protected using Error Correction Codes (ECCs). In this work, we study ECC's impact on such models and propose two lightweight alternatives to ECCs that achieve superior reliability.