Built a Japanese ASR benchmark because existing ones can't measure quality differences properly

Was fine-tuning a Japanese ASR model (based on Qwen3-ASR) to handle technical terminology better. The model clearly improved - "Next.js" comes out as "Next.js" instead of "ネクストジェイズ", punctuation works, etc. But existing Japanese benchmarks scored it almost the same as the base model. Turns out Japanese ASR benchmarks have a structural problem: Japanese has 4 writing systems (hiragana, katakana, kanji, Latin), so the same word has multiple valid spellings. Benchmarks either penalize valid alternatives or normalize everything away (losing real quality signals.