What kinds of models are people training with document data? [P]

We've helped some folks with synthetic data for a number of different projects and some of them for "document data". Like annotated PDFs, PNGs. Tax forms, health forms. Especially things with PII that are hard to get because of obvious privacy concerns. So, we came up with an engine to build a simulation and then extract the data from that simulation. We're trying to make sure our pipeline fits into a normal