Cuda + ROCm simultaneously with -DGGML_BACKEND_DL=ON !

I invested quite a bit of time and it wasn't easy but finally I can run models like Minimax 2.7 Q4 using Cuda+ROCm at the same time bypassing Vulkan. load_tensors: offloaded 63/63 layers to GPU load_tensors: CUDA0 model buffer size = 83650.42 MiB load_tensors: CUDA_Host model buffer size = 622.76 MiB load_tensors: ROCm0 model buffer size = 40314.35 MiB the main advantage is the prefill.