[Release] Carnice-9b-W8A16-AWQ – AWQ Quantization Optimized for vLLM + Marlin on Ampere GPUs (Single-GPU)
r/LocalLLaMA
•
Machine Learning
Generative AI
MLOps
AI Hardware
Open Source AI
Hey r/LocalLLaMA, I am releasing my first model quantization: an 8-bit symmetric AWQ (W8A16) of kai-os/Carnice-9b, specifically optimized for Ampere GPUs (RTX 30-series) using vLLM with the Marlin kernel on a single-GPU inference setup. kai-os/Carnice-9b is a specialized fine-tune of Qwen/Qwen3.5-9B that removes the visual components and adopts the Qwen3_5ForCausalLM architecture for pure text/agentic use (Hermes Agent harness). This architecture is not yet natively ed by vLLM (pending.