support for microsoft/Phi-4-reasoning-vision-15B has been merged into llama.cpp

r/LocalLLaMA
Machine Learning Generative AI Open Source AI AI Research

You may remember this model Phi-4-Reasoning-Vision-15B is a compact open-weight multimodal reasoning model built on the Phi-4-Reasoning language model backbone and the SigLIP-2 vision encoder, using a mid-fusion architecture. In this architecture, the vision encoder first converts images into visual tokens, which are then projected into the language model's embedding space and injected into the pretrained language model. This approach leverages the strengths of both pretrained components while keeping.