Strix Halo + eGPU RTX 5070 Ti via OCuLink in llama.cpp: Benchmarks and Conclusions

Hey everyone. I have a Strix Halo miniPC (Minisforum MS-S1 Max). I added an RTX 5070 Ti eGPU to it via OCuLink, ran some tests on how they work together in llama.cpp, and wanted to share some of my findings. TL;DR of my findings: Vulkan's versatility: It's a highly efficient API that lets you stably combine chips from different vendors (like an AMD APU + NVIDIA GPU). The performance drop compared to native CUDA or ROCm is minimal, just about 5-10%. The role of OCuLink: The bandwidth of this connection doesn't bottleneck token generation (tg) or prompt processing (pp.