AI RESEARCH
HALvest-Contrastive: Retrieval-Like Authorship Attribution with Patch-Level Late Interaction
arXiv CS.CL
•
ArXi:2407.20595v4 Announce Type: replace-cross Deciding whether two pieces of text share an author is made difficult by topical confound: two writers covering the same topic often look alike than one writer covering two topics. We tackle this with HALvest, a 17-billion-token multilingual corpus of open-access scholarly papers, and its English contrastive derivative HALvest-Contrastive, in which same-author passages are drawn from distinct papers within a field to minimize topical overlap. We also revisit how documents are compared.