When are we going to see natively multimodal local text-image models?

r/StableDiffusion
Generative AI

Inputs: img/txt, outputs: img/txt. Predictions please. submitted by /u/wojtulace [link] [comments]