A Thousand Words - Image Captioning (Vision Language Model) interface

r/StableDiffusion
Generative AI

I've spent a lot of time creating various "batch processing scripts" for various VLM's in the past ( Github repo search ). Instead, I decided to spend way too much time to write a GUI that unifies all / most of them in one place. A hub tool for running many different image-to-text models in one place. Allowing you to switch between models, have preset prompts, do some pre/post editing, even batch multiple models in sequence. All in one GUI, but also as a server / API so you can request this from other tools.