Sparse Growing Transformer: Training-Time Sparse Depth Allocation via Progressive Attention Looping

ArXi:2603.23998v1 Announce Type: new Existing approaches to increasing the effective depth of Transformers predominantly rely on parameter reuse, extending computation through recursive execution. Under this paradigm, the network structure remains static along the