Towards the Automatic Processing of Language Registers: Semi-supervisedly Built Corpus and Classifier for French

Fiche du document

Date

7 avril 2019

Discipline
Type de document
Périmètre
Langue
Identifiants
Relations

Ce document est lié à :
info:eu-repo/semantics/altIdentifier/doi/10.1007/978-3-031-24337-0_34

Collection

Archives ouvertes

Licence

info:eu-repo/semantics/OpenAccess


Sujets proches En

Registers, lists, etc

Citer ce document

Gwénolé Lecorvé et al., « Towards the Automatic Processing of Language Registers: Semi-supervisedly Built Corpus and Classifier for French », HAL-SHS : linguistique, ID : 10.1007/978-3-031-24337-0_34


Métriques


Partage / Export

Résumé En

Language registers are a strongly perceptible characteristic of texts and speeches. However, they are still poorly studied in natural language processing. In this paper, we present a semi-supervised approach which jointly builds a corpus of texts labeled in registers and an associated classifier. This approach relies on a small initial seed of expert data. After massively retrieving web pages, it iteratively alternates the training of an intermediate classifier and the annotation of new texts to augment the labeled corpus. The approach is applied to the casual, neutral, and formal registers, leading to a 750M word corpus and a final neural classifier with an acceptable performance.

document thumbnail

Par les mêmes auteurs

Sur les mêmes sujets

Sur les mêmes disciplines

Exporter en