I Ran Kotlin HumanEval on 11 Local LLMs. An 8GB Model Beat Several 30B Models

TLDR: I ran JetBrains' Kotlin HumanEval on 11 local models, including some small ones that fit on a 16 GB VRAM GPU. Here are the results. pass / pass: GPT-OSS 20B: 85% / 95% Qwen3.5-35B-a3b: 77% / 86% EssentialAI RNJ-1: 75% / 81% ← 8.8 GB file size Seed-OSS-36B: 74% / 81% GLM 4.7 Flash: 68% / 78% A few things I found interesting: GPT-OSS 20B still dominates at 85% pass, despite being one of the smaller models by file size (12 GB) EssentialAI RNJ-1 at 8.8 GB took third place overall, beating models 2-3x its size Qwen jumped 18 points in seven months Happy to answer questions about the setup.