Recently I did a little performance test of several LLMs on PC with 16GB VRAM
r/LocalLLaMA
•
Generative AI
Open Source AI
Qwen 3.5, Gemma-4, Nemotron Cascade 2 and GLM 4.7 flash. Tested to see how performance (speed) degrades with the context increase. used llama.cpp and some nice quants better fitting for 16GB VRAM in my RTX 4080. Here is a result comparison table. Hope you find it useful. submitted by /u/rosaccord [link] [comments]