Recently I did a little performance test of several LLMs on PC with 16GB VRAM

r/LocalLLaMA
Generative AI Open Source AI

Qwen 3.5, Gemma-4, Nemotron Cascade 2 and GLM 4.7 flash. Tested to see how performance (speed) degrades with the context increase. used llama.cpp and some nice quants better fitting for 16GB VRAM in my RTX 4080. Here is a result comparison table. Hope you find it useful. submitted by /u/rosaccord [link] [comments]