Extract Clean Text from Any Webpage for RAG Pipelines
Dev.to AI
•
Generative AI
Building RAG (Retrieval-Augmented Generation) systems? You need clean text, not raw HTML. Here's a simple approach using CheerioCrawler: // Remove noise $ ( " script, style, na, footer, header, aside,.ad, noscript " ). remove; // Get main content let text = $ ( " article, [role=main], main,.content " ). first. length < 100 ) text = $ ( " body " ). replace ( / \s +/g, " " ). trim; Why Not Just Use body.text? Raw body text includes navigation menus, footer links, cookie banners, and ad text. For RAG, you want ONLY the main content.