AI RESEARCH

20M+ Indian legal documents with citation graphs and vector embeddings – potential uses for legal NLP? [D]

r/MachineLearning

Been working on structuring India's legal corpus for the past 2 years and wanted to share what I've built and hear from people working on legal NLP or low-resource Indian language models. dataset is 20M+ Indian court cases from the Supreme Court, all 25 High Courts, and 14 Tribunals. each case has structured metadata (court, bench, date, parties, judges, sections cited, acts referenced, case type). there's a citation graph across the full corpus where I've classified relationships as followed, distinguished, overruled, or mentioned.