Qwen3.5:35B-A3B on RTX 5090 32GB - KV cache quantization or lower weight quant to fit parallel requests?
r/LocalLLaMA
•
Generative AI
AI Hardware
AI Tools
Running a small company AI assistant (V&V/RAMS engineering) on Open WebUI + Ollama with this setup: GPU: RTX 5090 32GB VRAM Model: Qwen3.5:35b (Q4_K_M) ~27GB Embedding: nomic-embed-text-v2-moe ~955MB Context: 32768 tokens OLLAMA_NUM_PARALLEL: 2 The model is used by 4-5 engineers simultaneously through Open WebUI. The problem: nvidia-smi shows 31.4GB/32.6GB used, full with one request. With NUM_PARALLEL=2, when two users query at the same time, the second one hangs until the first completes.