support for microsoft/Phi-4-reasoning-vision-15B has been merged into llama.cpp
r/LocalLLaMA
•
Machine Learning
Generative AI
Open Source AI
AI Research
You may remember this model Phi-4-Reasoning-Vision-15B is a compact open-weight multimodal reasoning model built on the Phi-4-Reasoning language model backbone and the SigLIP-2 vision encoder, using a mid-fusion architecture. In this architecture, the vision encoder first converts images into visual tokens, which are then projected into the language model's embedding space and injected into the pretrained language model. This approach leverages the strengths of both pretrained components while keeping.