X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs

ArXi:2603.24596v1 Announce Type: cross While the shift from cascaded dialogue systems to end-to-end (E2E) speech Large Language Models (LLMs) improves latency and paralinguistic modeling, E2E models often exhibit a significant performance degradation compared to their text-based counterparts. The standard Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL)