Semantic video search using local Qwen3-VL embedding, no API, no transcription

r/LocalLLaMA
Generative AI AI Hardware

I've been experimenting with Qwen3-VL-Embedding for native video search, embedding raw video directly into a vector space alongside text queries. No transcription, no frame captioning, no intermediate text. You just search with natural language and it matches against video clips. The surprising part: the 8B model produces genuinely usable results running fully local. Tested on Apple Silicon (MPS) and CUDA. The 8B model needs ~18GB RAM, the 2B runs on ~6GB. I built a CLI tool around this ( SentrySearch ) that indexes footage into ChromaDB, searches it, and auto-trims the matching clip.