Accelerating Sparse Transformer Inference on GPU

ArXi:2506.06095v4 Announce Type: replace Large language models (LLMs) are popular around the world due to their powerful understanding capabilities. As the core component of LLMs, accelerating Transformer through parallelization has gradually become a hot research topic. Mask layers