Languages(s) of the SHUN-PAO, a Computational Linguistics account

Fiche du document

Date

3 décembre 2019

Discipline
Type de document
Périmètre
Langue
Identifiants
Relations

Ce document est lié à :
info:eu-repo/grantAgreement//788476/EU/Elites, networks, and power in modern urban China (1830-1949)/ENPMUC

Collection

Archives ouvertes

Licence

info:eu-repo/semantics/OpenAccess




Citer ce document

Pierre Magistry, « Languages(s) of the SHUN-PAO, a Computational Linguistics account », HAL-SHS : linguistique, ID : 10670/1.cqlvhs


Métriques


Partage / Export

Résumé En

This work is part of a broader project which requires adapting information extraction (IE) methods to written materials (mostly press articles) published in China between the mid 19th and the mid 20th centuries. This calls for a better understanding and description of the language(s) we can observe in our sources. More importantly, it is an unprecedented opportunity to provide a usage-based description of written languages as used in the press in Modern China. There is an abundant literature describing this pivotal era from different perspectives and disciplines related to language, including the history of language policies (Kaske, 2008), the socio-linguistic aspects (Weng, 2018) or historical linguistics (Coblin, 2000, Simmons, 2017). However what is presented in this article is, as far as I know, the first usage-based study to leverage a complete corpus of almost 80 years of a daily newspaper, the Shen-Pao(申報), containing about 750 Millions sinograms to account for the actual practices and their evolution through time. In order to do so, I propose new Computational Linguistics methods and tools inspired by recent works in the field, especially Language Modeling and Contextual String Embeddings.

document thumbnail

Par les mêmes auteurs

Sur les mêmes sujets

Sur les mêmes disciplines

Exporter en