NEO-unify — A 2B multimodal model with no Vision Encoder, no VAE. Open source coming "hopefully not too long"
r/LocalLLaMA
•
AI Research
SenseTime (the Chinese AI lab) just published details on NEO-unify, a multimodal model that throws out the vision encoder AND the VAE. Just raw pixels in, raw pixels out.