Rethinking Generative Image Pretraining: How Far Are We From Scaling Up Next-Pixel Prediction?

ArXi:2511.08704v2 Announce Type: replace-cross This paper investigates the scaling properties of autoregressive next-pixel prediction, a simple, end-to-end yet under-explored framework for unified vision models. Starting with images at resolutions of 32x32, we train a family of Transformers using IsoFlops profiles across compute budgets up to 7e19 FLOPs and evaluate three distinct target metrics: next-pixel prediction objective, ImageNet classification accuracy, and generation-based completion measured by Fr'echet Distance. First, optimal scaling strategy is critically task-dependent.