AI RESEARCH
Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers
arXiv CS.LG
•
ArXi:2605.06169v1 Announce Type: new Scaling Diffusion Transformers (DiTs) to hundreds of layers To address this, we propose Mean-Variance Split (MV-Split) Residuals, which combine a separately gained centered residual update with a leaky trunk-mean replacement. On a 400-layer single-stream DiT, MV-Split prevents the divergent collapse that crashes the un-stabilized baseline; it tracks close to the baseline's pre-crash trajectory while remaining substantially better than token-isotropic gating methods such as LayerScale across the full schedule.