AI RESEARCH

ParlaSpeech 3.0: Richly Annotated Spoken Parliamentary Corpora of Croatian, Czech, Polish, and Serbian

arXiv CS.CL

ArXi:2511.01619v2 Announce Type: replace ParlaSpeech is a collection of spoken parliamentary corpora currently spanning four Slavic languages - Croatian, Czech, Polish and Serbian - all together 6K hours in size. The corpora were built in an automatic fashion from the ParlaMint transcripts and their corresponding metadata, which were aligned to the speech recordings of each corresponding parliament. In this release of the dataset, each of the corpora is significantly enriched with various automatic annotation layers.