NEO-unify — A 2B multimodal model with no Vision Encoder, no VAE. Open source coming "hopefully not too long"

r/LocalLLaMA
AI Research

SenseTime (the Chinese AI lab) just published details on NEO-unify, a multimodal model that throws out the vision encoder AND the VAE. Just raw pixels in, raw pixels out.