How well do current models handle Icelandic audio?

I’ve been doing some informal testing on how current multimodal models handle speech + multilingual understanding, and came across an interesting behavior that feels slightly beyond standard translation. I used a short audio clip in a language I don’t understand (likely Icelandic) and evaluated the output along a few dimensions:1. Transcription qualityThe model produced a relatively clean transcript, with no obvious structural breakdown.2. Translation fidelity vs.