Nested Music Transformer: Sequentially Decoding Compound Tokens in Symbolic Music and Audio Generation

ArXi:2408.01180v2 Announce Type: replace-cross Representing symbolic music with compound tokens, where each token consists of several different sub-tokens representing a distinct musical feature or attribute, offers the advantage of reducing sequence length. While previous research has validated the efficacy of compound tokens in music sequence modeling, predicting all sub-tokens simultaneously can lead to suboptimal results as it may not fully capture the interdependencies between them. We.