ViT-Explainer: An Interactive Walkthrough of the Vision Transformer Pipeline

ArXi:2604.02182v1 Announce Type: new Transformer-based architectures have become the shared backbone of natural language processing and computer vision. However, understanding how these models operate remains challenging, particularly in vision settings, where images are processed as sequences of patch tokens. Existing interpretability tools often focus on isolated components or expert-oriented analysis, leaving a gap in guided, end-to-end understanding of the full inference pipeline.