I trained an anime image model in 2 days from scratch on 1 local GPU

Using a combination of recent papers, I trained a 250M text-to-image anime model in 2 days from scratch (not a finetune of an existing diffusion model) on 1 local RTX Pro 6000 GPU. VAE: Trained in 8 hours using DINOv3 as the encoder Diffusion Model: Trained in 42 hours.