End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering

ArXi:2511.09282v3 Announce Type: replace-cross Significant progress has been made in spoken question answering (SQA) in recent years. However, many existing methods, including large audio language models, struggle with processing long audio. Follow the success of retrieval augmented generation, a speech-related retriever shows promising in help preprocessing long-form speech. But the performance of existing speech-related retrievers is lacking.