Dual GPU llama.cpp speedup

Llama.cpp has had a long standing issue with "--split-mode tensor", you'll get great results but it only s non-quantized KV caches, for this very reason a lot of people decide to go with a healthy sized KV cache and ignore tensor parallelism. I've had a stab at fixing the issue here - - it's branched from mainline as of today, with minimal changes. I'm personally running a 3060 12gb + 4070 Super 12gb, for a combined 24gb.