We’ve resolved the data anonymization challenge, but data extraction is slow. What is your technology stack? [D]

I am currently building a RAG pipeline that needs to process a massive volume of messy legacy data - including outdated reports, poorly formatted emails, various PDFs, mobile photos, and more. While the retrieval and generation components are functioning smoothly, I’ve hit a major bottleneck during the data preparation phase,specifically regarding data anonymization and schema mapping.