Extension idea: llama-server with custom samplers

r/LocalLLaMA
Generative AI Open Source AI

Just an idea and a prototype (made by Qwen3.6-27B-UD-Q6_K_XL via OpenCode) for allowing users to add custom sampling logic to llama-server without having to maintain their own entire fork and without having to make a wrapper that reimplements everything llama-server can do. Included is an example extension that detects and breaks one kind of loop that I've commonly seen with heavily quantized models, where they get stuck repeating the same 1-3 tokens.