Contrastive Entity Coreference and Disambiguation for Historical Texts

Abhishek Arora; Emily Silcock; Leander Heldring; Melissa Dell

Contrastive Entity Coreference and Disambiguation for Historical Texts

Fiche du document

Auteurs

Date

21 juin 2024

Discipline

Type de document

Textes imprimés

Périmètre

Publications

Identifiant

2406.15576

Source

arXiv - économie

Collection

arXiv

Organisation

Cornell University

Mots-clés Und

Computer Science - Computation and Language Economics - General Economics

Sujets proches En

Competence Documents Documents, Legal Documents Public documents Official publications Government documents Economic theory Political economy Documents Documents Indentures Documents Charters--Law and legislation Documents Manuscript repositories Manuscript depositories Manuscripts--Repositories Manuscripts--Depositories

Citer ce document

Abhishek Arora et al., « Contrastive Entity Coreference and Disambiguation for Historical Texts », arXiv - économie

Partage / Export

Résumé 0

Massive-scale historical document collections are crucial for social science research. Despite increasing digitization, these documents typically lack unique cross-document identifiers for individuals mentioned within the texts, as well as individual identifiers from external knowledgebases like Wikipedia/Wikidata. Existing entity disambiguation methods often fall short in accuracy for historical documents, which are replete with individuals not remembered in contemporary knowledgebases. This study makes three key contributions to improve cross-document coreference resolution and disambiguation in historical texts: a massive-scale training dataset replete with hard negatives - that sources over 190 million entity pairs from Wikipedia contexts and disambiguation pages - high-quality evaluation data from hand-labeled historical newswire articles, and trained models evaluated on this historical benchmark. We contrastively train bi-encoder models for coreferencing and disambiguating individuals in historical texts, achieving accurate, scalable performance that identifies out-of-knowledgebase individuals. Our approach significantly surpasses other entity disambiguation models on our historical newswire benchmark. Our models also demonstrate competitive performance on modern entity disambiguation benchmarks, particularly certain news disambiguation datasets.

Contrastive Entity Coreference and Disambiguation for Historical Texts

Fiche du document

Mots-clés Und

Sujets proches En

Citer ce document

Partage / Export

Résumé 0

Par les mêmes auteurs

Sur les mêmes sujets

Sur les mêmes disciplines

Exporter en