Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

ArXi:2512.16378v3 Announce Type: replace-cross As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which directly process spoken language and enable speech-to-text translation (ST) and other downstream tasks, bypassing traditional transcription-based pipelines. Whether this integration improves ST quality over established cascaded architectures, however, remains an open question.