EffiMiniVLM: A Compact Dual-Encoder Regression Framework

ArXi:2604.03172v1 Announce Type: new Predicting product quality from multimodal item information is critical in cold-start scenarios, where user interaction history is unavailable and predictions must rely on images and textual metadata. However, existing vision-language models typically depend on large architectures and/or extensive external datasets, resulting in high computational cost.