Uncovering Machine Translationese Using Corpus Analysis Techniques to Distinguish between Original and Machine­-Translated French

Fiche du document

Date

2021

Discipline
Type de document
Périmètre
Langue
Identifiants
Collection

Archives ouvertes

Licence

info:eu-repo/semantics/OpenAccess



Sujets proches En

Frenchmen (French people)

Citer ce document

Orphée de Clercq et al., « Uncovering Machine Translationese Using Corpus Analysis Techniques to Distinguish between Original and Machine­-Translated French », HAL-SHS : linguistique, ID : 10670/1.2v7orh


Métriques


Partage / Export

Résumé En

This paper investigates the linguistic characteristics of English to French machine­-translatedtexts in comparison with French original, untranslated texts in order to uncover what has been called “machine translationese”. In the same vein as corpus­-based translation studies which have focused on human­-translated texts, and using a corpus­-based statistical approach (Principal Component Analysis), we analyzed a ca. 1.8­-million­-word corpus of English to French translations of press texts, corresponding to the output of four machine translation sy­stems: one statistical (SMT) and three neural (NMT) systems, namely DeepL, Google Trans­late, and the European Commission’s eTranslation MT tool, in both its SMT and NMT ver­sions. In particular, to complement a previous study on language­-specific features in French(e.g. derived adverbs, existential constructions, coordinator et, preposition avec), a series of language­-independent linguistic features were extracted for each text in our corpus, ranging from superficial text characteristics such as average word and sentence length to frequencies of closed­ class lexical categories and measures of lexical diversity. Our results, which compare the machine­-translated data with a corpus of French untranslated data, allow us to uncoverlinguistic features in French machine­-translated texts that clearly deviate from the observed norms in original French (e.g.average sentence length, n­gram features, lexicaldiversity), and which might serve as information for the post­-diting process in order to optimize translation quality.

document thumbnail

Par les mêmes auteurs

Sur les mêmes sujets

Sur les mêmes disciplines

Exporter en