Generative Synthetic Data for Causal Inference: Pitfalls, Remedies, and Opportunities

ArXi:2604.23904v1 Announce Type: cross Synthetic data offers a promising tool for privacy-preserving data release, augmentation, and simulation, but its use in causal inference requires preserving than predictive fidelity. We show that fully generative tabular synthesizers, including GAN- and LLM-based models, can achieve strong train-on-synthetic-test-on-real performance while substantially distorting causal estimands such as the average treatment effect