AI RESEARCH

Smart Bilingual Focused Crawling of Parallel Documents

arXiv CS.LG

ArXi:2405.14779v2 Announce Type: replace-cross Crawling parallel texts -- texts that are mutual translations -- from the Internet is usually done following a brute-force approach: documents are massively downloaded in an unguided process, and only a fraction of them end up leading to actual parallel content. In this work we propose a smart crawling method that guides the crawl towards finding parallel content rapidly.