AI RESEARCH

C-Mining: Unsupervised Discovery of Seeds for Cultural Data Synthesis via Geometric Misalignment

arXiv CS.CL

ArXi:2604.15675v1 Announce Type: new Achieving cultural alignment in Large Language Models (LLMs) increasingly depends on synthetic data generation. For such synthesis, the most vital initial step is seed curation; however, current methods lack quantifiable standards for selecting these seeds. Existing approaches rely on unscalable manual curation or bias-prone LLM extraction, treating cultural specificity as an abstract concept rather than a measurable signal.