Grafting vision onto text models for fun and profit.

So as we know. llama.cpp separates the vision or other multimedia from the main weights. Conversely, trained model capabilities might be removed at release. What if there was a way to put them back? Mistral has now released both pixtral and medium vision encoders. The tokenizers of past models contain the relevant parts. "10": { "content": "[IMG]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true }, Let's take Behemoth-X because I rather like that model.