kernel-anvil: 2x decode speedup on AMD by auto-tuning llama.cpp kernels per model shape

Built a tool that profiles your GGUF model's layer shapes on your AMD GPU and generates optimal kernel configs that llama.cpp loads at runtime. No recompilation needed. The problem: llama.cpp's MMVQ kernels use the same thread/block configuration for every layer regardless of shape. A 1024-row GQA projection gets the same settings as a 17408-row FFN layer. This leaves significant performance on the table, especially on RDNA3. The fix: kernel-anvil reads your GGUF, identifies the unique GEMV shapes, profiles each one on your actual GPU, and writes a JSON config file.