Qwen 3.5 Vision on vLLM + llama.cpp — 6 things I find out after few weeks testing (preprocessing speedups, concurrency).

r/LocalLLaMA
Generative AI Open Source AI AI Tools

Hi guys I have running experiments on Qwen 3.5 Vision hard for a few weeks on vLLM + llama.cpp in Docker. A few things I find out. 1. Long-video OOM is almost always these three vLLM flags `--max-model-len`, `--max-num-batched-tokens`, `--max-num-seqs A 1h45m video can hit 18k+ visual tokens and blow past the 16k default before inference even starts. Chunk at the application level (≤300s segments), free the KV cache between chunks, then you can do a second-pass summary to run it even on low local resources, 2. Segment overlap matter Naive chunking splits events at boundaries.