AI RESEARCH
Infini-News: Efficiently Queryable Access to 1.3 Billion Processed Common Crawl News Articles
arXiv CS.CL
•
ArXi:2605.18337v1 Announce Type: new Large-scale news corpora a wide range of research in Computational Social Science and NLP, yet access remains constrained: commercial archives impose prohibitive costs and licensing restrictions, while open alternatives like Common Crawl's CC-News require terabyte-scale storage and computationally intensive processing. We present Infini-News, a retrieval toolkit and index for the entire CC-News archive from August 2016 to the latest available snapshot. Our contributions are threefold.