MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers

ArXi:2603.08879v1 Announce Type: new Accessing sensitive patient data for machine learning is challenging due to privacy concerns. Datasets with annotations of personally identifiable information are crucial for developing and testing anonymization systems to enable safe data sharing that complies with privacy regulations. Since accessing real patient data is a bottleneck, synthetic data offers an efficient solution for data scarcity, bypassing privacy regulations that apply to real data.