Improving Visual Recommendation with Vision-Language Model Embeddings

Moving from CNN’s Low-Level Visual Features to Deep Semantic Embeddings with SigLIP. Image by the author. Convolutional Neural Networks (CNNs) have important semantic limitations: while they capture low and mid-level visual features (such as edges, textures, and colors), they often fail to encode the high-level global and semantic context that Vision-Language Models (VLMs) provide. A previous story illustrated that recommendations relying solely on ResNet50 embeddings can include semantically irrelevant items.