I need help with testing my llama.cpp Deepseek Sparse Attention (DSA) implementation (someone GPU-rich)

I have initial proof-of-concept implementation ready and now I want to confirm that it works correctly. Unfortunately the difference between the model performance with dense vs sparse attention is subtle and it's visible only for very complex problems. Basically you need a full benchmark run to make sure the implementation works correctly. I can't do it on my Epyc 9374F + RTX PRO 6000 workstation as it would take hundreds of hours.