PixelDiT: Pixel Diffusion Transformers for Image Generation Pixel Diffusion Transformers for Image Generation, 1.3B, no VAE

PixelDiT is a 1.3B parameter text-to-image model by NVidia with image editing capabilities. Key features: VAE-free Dual-level architecture: Patch-level DiT + Pixel-level DiT MM-DiT text-image fusion: Joint attention between text and image tokens Text encoder: Gemma-2-2B-IT Multi-aspect-ratio: s various aspect ratios at 1024px Relevant links: [Project page] [Paper] [Github page] [HuggingFace page (diffusers)] [ComfyUI version] [Workflow] (in first comment) (There was an earlier post about this model with a few upvotes.