[Release] MPS-Accelerate — ComfyUI custom node for 22% faster inference on Apple Silicon (M1/M2/M3/M4)

Hey everyone! I built a ComfyUI custom node that accelerates F.linear operations on Apple Silicon by calling Apple's MPSMatrixMultiplication directly, bypassing PyTorch's dispatch overhead. **Results:** - Flux.1-De (5 steps): 8.3s/it → was 10.6s/it native (22% faster) - Works with Flux, Lumina2, z-image-turbo, and any model on MPS - s float32, float16, and bfloat16 **How it works:** PyTorch routes every F.linear through Python → MPSGraph → GPU. MPS-Accelerate short-circuits this: Python → C++ pybind11 → MPSMatrixMultiplication.