Discriminating between Similar Languages using Weighted Subword Features

Fiche du document

Date

3 avril 2017

Discipline
Type de document
Périmètre
Langue
Identifiants
Relations

Ce document est lié à :
info:eu-repo/semantics/altIdentifier/doi/10.18653/v1/W17-1223

Collection

Archives ouvertes

Licence

info:eu-repo/semantics/OpenAccess




Citer ce document

Adrien Barbaresi, « Discriminating between Similar Languages using Weighted Subword Features », HAL-SHS : linguistique, ID : 10.18653/v1/W17-1223


Métriques


Partage / Export

Résumé En

The present contribution revolves around a contrastive subword n-gram model which has been tested in the Discriminating between Similar Languages shared task. I present and discuss the method used in this 14-way language identification task comprising varieties of 6 main language groups. It features the following characteristics: (1) the preprocessing and conversion of a collection of documents to sparse features; (2) weighted character n-gram profiles; (3) a multinomial Bayesian classifier. Meaningful bag-of-n-grams features can be used as a system in a straightforward way, my approach outperforms most of the systems used in the DSL shared task (3rd rank).

document thumbnail

Par les mêmes auteurs

Sur les mêmes sujets

Sur les mêmes disciplines

Exporter en