Running the new Qwen3.6-35B-A3B at full context on both a 4090 and GB10 Spark with vLLM and Llama.cpp

r/LocalLLaMA
Generative AI AI Hardware Open Source AI AI Tools

Here is how to run the new Qwen3.6-35B-A3B > At full context on a 4090 - IQ4_XS gguf with llama cpp > At full context on a Spark - FP8 with a tweaked vLLM Here is the docker compose with llama cpp services: llamacpp: container_name: llamacpp-qwen3-6-35b-a3b-iq4xs image: ghcr.io/ggml-org/llama.cpp:server-cuda restart: unless-stopped gpus: all shm_size: "8gb" ipc: host environment: - NVIDIA_VISIBLE_DEVICES=all - NVIDIA_DRIVER_CAPABILITIES=compute,utility command: - -m - /models/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf - --host - 0.0.0.0 - --port - "8000" - --alias.