GQA-{\mu}P: The maximal parameterization update for grouped query attention

ArXi:2605.15290v1 Announce Type: cross Hyperparameter transfer across model architectures dramatically reduces the amount of compute necessary for tuning large language models (LLMs). The maximal update parameterization ({\mu}P) ensures transfer through principled mathematical analysis but can be challenging to derive for new model architectures. Building on the spectral feature-learning view of Yang (2023a), we make two advances.