Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading

ArXi:2410.21316v2 Announce Type: replace-cross Transformers and large language models~(LLMs) have seen rapid adoption in all domains. Their sizes have exploded to hundreds of billions of parameters and keep increasing. Under these circumstances, the