RDMA Mac Studio cluster - performance questions beyond generation throughput

r/LocalLLaMA
Generative AI

Jeff Geerling’s RDMA cluster benchmarks showed great generation throughput (31.9 tok/s on 4 nodes for Qwen3 235B), but I have questions about other performance aspects. Anyone with an RDMA cluster setup: Prefill speed - Prompt processing at 32K/64K/128K context. Single node vs clustered. Does aggregate bandwidth help or does RDMA overhead eat it? Time to first token - Latency before output starts. How does it scale with nodes? KV cache - Does cache persist across nodes between turns? Or re-prefill every query? Model loading - Cold-start time for 200B+ models. Single vs distributed.