I Found a Hidden Ratio in Transformers That Predicts Geometric Stability [R]

I have analyzed some decoder transformer models using Lyapuno spectral analysis and found that the ratio of the MLP and attention spectral norms strongly indicates whether a model will eventually collapse to rank-1 or not by the final layers. I found that the spectral ratio is best kept around 0.5-2 for keeping the model stable till the final layers. Github repo: submitted by /u/Otaku_7nfy [link] [comments]