The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

ArXi:2604.20665v1 Announce Type: new The rapid proliferation of Vision-Language Models (VLMs) is widely celebrated as the dawn of unified multimodal knowledge discovery but its foundation operates on a dangerous, unquestioned axiom: that current VLMs faithfully synthesise multimodal data. We argue they do not. Instead, a profound crisis of trustworthiness underlies the dominant Vision Encoder-Projector-LLM paradigm.