VLM4Rec: Multimodal Semantic Representation for Recommendation with Large Vision-Language Models

ArXi:2603.12625v1 Announce Type: cross Multimodal recommendation is commonly framed as a feature fusion problem, where textual and visual signals are combined to better model user preference. However, the effectiveness of multimodal recommendation may depend not only on how modalities are fused, but also on whether item content is represented in a semantic space aligned with preference matching.