Bench 8xMI50 MiniMax M2.7 AWQ @ 64 tok/s peak (vllm-gfx906-mobydick)

r/LocalLLaMA
Generative AI Open Source AI AI Tools

Inference engine used (vllm fork): Huggingface Quants used: cyankiwi/MiniMax-M2.7-AWQ-4bit Relevant commands to run: docker run -it --name vllm-gfx906-mobydick-mixa3607 - ~/llm/models:/models --network host --device=/de/kfd --device=/de/dri --group-add video \ --group-add $(getent group render | cut -d: -f3) --ipc=host mixa3607/vllm-gfx906:0.19.1-rocm-7.2.1-aiinfos-20260405173349 FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" OMP_NUM_THREADS=4 VLLM_LOGGING_LEVEL=DEBUG NCCL_DEBUG=INFO vllm serve \ /llm/models/MiniMax-M2.7-AWQ-4bit \ --served-model-name MiniMax-M2.7-AWQ-4bit.