Qwen3.5-Omni: Scaling Up, Toward Native Omni-Modal AGI (94 minute read)

Qwen3.5-Omni is a full omnimodal large language model that understands text, images, audio, and audio-visual content. It can process than 10 hours of audio input and over 400 seconds of 720P audio-visual input at 1 FPS. The model is trained on a massive amount of text and visual data, and than 100M hours of audio-visual data. It s speech recognition in 113 languages and dialects and speech generation in 36 languages and dialects.