I finetuned Qwen3-1.7B to imitate original Z-Image text encoder. 21% less VRAM

r/StableDiffusion
Machine Learning AI Research

First image is from orignal pipeline, second is from pipeline with replaced text encoder. I finetuned Qwen3-1.7B with small adapter to imitate Qwen3-4B. Idea was simple: recreate hidden states of Qwen3-4B and pass it to DiT. I tested it using fp16 Metric Original (4B) Student (1.7B) Savings Weight VRAM 20.70 GB 16.30 GB 4.40 GB (21%) Peak VRAM 21.35 GB 16.76 GB 4.59 GB (22%) Generation time 3.9s 3.5s - I haven't provided a quantized version for this specific model yet.