AI RESEARCH

DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72

arXiv CS.AI

ArXi:2604.01621v2 Announce Type: replace-cross Large language model (LLM) inference increasingly depends on multi-GPU execution, yet existing inference parallelization strategies require layer-wise inter-rank synchronization, making end-to-end performance sensitive to workload imbalance. We present DWDP (Distributed Weight Data Parallelism), an inference parallelization strategy that preserves data-parallel execution while offloading MoE weights across peer GPUs and fetching missing experts on demand.