Moonwalk: Inverse-Forward Differentiation

ArXi:2402.14212v2 Announce Type: replace Backpropagation's main limitation is its need to intermediate activations (residuals) during the forward pass, which restricts the depth of trainable networks. This raises a fundamental question: can we avoid storing these activations? We address this by revisiting the structure of gradient computation. Backpropagation computes gradients through a sequence of vector-Jacobian products, an operation that is generally irreversible. The lost information lies in the cokernel of each layer's Jacobian.