Embeddings Just Went Multimodal: What Sentence Transformers 5.4 Means for RAG

Dev.to AI
Generative AI

The latest Sentence Transformers release quietly changes something fundamental. v5.4 adds native multimodal - same API, same patterns, but now you can encode and compare text, images, audio, and video in a shared embedding space. This isn't a wrapper. It's a direct extension of the embedding workflow that most RAG pipelines already use. The Shift Traditional embedding models convert text into fixed-size vectors. You encode a query, encode your documents, compute cosine similarity.