Speech Recognition on TV Series with Video-guided Post-ASR Correction

ArXi:2506.07323v3 Announce Type: replace-cross Automatic Speech Recognition (ASR) has achieved remarkable success with deep learning, driving advancements in conversational artificial intelligence, media transcription, and assistive technologies. However, ASR systems still struggle in complex environments such as TV series, where multiple speakers, overlapping speech, domain-specific terminology, and long-range contextual dependencies pose significant challenges to transcription accuracy.