Running Minimax 2.7 at 100k context on strix halo
r/LocalLLaMA
•
Generative AI
Open Source AI
Just wanted to share because it took me a lot of tweaking to get here: llama-server -hf unsloth/MiniMax-M2.7-GGUF:UD-IQ3_XXS --temp 1.0 --top-k 40 --top-p 0.95 --host 0.0.0.0 --port 8080 -c 100000 -fa on -ngl 999 --no-context-shift -fit off --no-mmap -np 2 --k-unified --cache-ram 0 -b 1024 -ub 1024 --cache-reuse 256 Reasoning behind the various options --no-context-shift I want to know when I run out of context instead of silently corrupting stuff --no-mmap Recommended by Donato -np 2 Retain context for up to two concurrent sessions --k-unified Make the two session share the same cache to.