Tunisian Arabizi: Linguistic Analyses and Corpus Building using Natural Language Processing

Elisa Gugliotta, « Arabizi tunisien : Analyses linguistiques et création d'un corpus par le biais du traitement automatique des langues (TAL) », HAL-SHS : linguistique, ID : 10670/1.tvzr1g

Partage / Export

Résumé En Fr

This work aims to investigate Tunisian Arabic while also providing a response to the lack of tools to support research on Tunisian Arabic. In particular was the objective of building a corpus suitable for various type of linguistic analyses, hence the many levels of annotation with which we have provided same. We particularly narrowed down our outlook to a specific variety of Tunisian Arabic, which is that employed for digital communications and identity sharing, namely what we defined as Digital Networked Writing. Moreover, we collected texts encoded in the typical writing system dedicatedto this context, namely Arabizi.In Chapter 1, we introduced the multilingual complexity of Tunisian, by providing the reader with some fundamental keys with which to capture the possible implications behind the use of this writing system. Among these key points, we considered the history of this country, the traditional dialectological classification of Tunisian Arabic, the historical and modern interpretation of the distribution of different linguistic entities across the territory. The topic of multilingualism, specifically, has served as a gateway for subsequent questions about the emergence of an urban variety over the others. Here, in fact, we also mentioned the diffusion of Tunisian in Computer-Mediated Communication (CMC), and in particular the dual mode of writing, digraphia. We finally presented the main characteristics of Tunisian Arabizi.In Chapter 2, after a brief introduction, we outlined our methodology. First of all, we described the structural characteristics of written CMC and how it is addressed by linguistic research. We then went into details about the good practices to be observed when building linguistic corpora. In the end, in order to speed up the process, and in order to ensure the reproducibility of the methodology adopted, but also to extend the pool of usefulness of both the corpus and the procedure itself, we opted for the use of deep learning techniques. In addition, we also provided some useful information for understanding the techniques employed to build our corpus. In the end, we reported the state-of-the-art with the aim to share the information concerning the various types of work carried out with a methodology similar to ours on Modern Standard Arabic (MSA) and dialectal Arabic. A second objective of the state-of-the-art was to highlight the lack of resources available to support the research on Tunisian Arabic in this research area (and in general on Maghrebi). Chapter 3 dealt with the specific operations, which were described step by step, that led us to the corpus realisation. The chapter traced the path back to the beginning, starting with the collection of data and the decisions made for selecting and collecting metadata of the texts. We also described the stages of semi-automatic annotation of the corpus into its annotation layers: classification at word-level, transliteration into Arabic characters, tokenisation, Part-of-Speech tagging, and lemmatisation. We also discussed the experiments that led us to identify the best strategies for achieving our goals moving from a first phase, to the second. Finally, we described the result of the second phase procedure, which consists of the multi-task sequence prediction architecture. This was the tool which was built to produce the different annotation layers that make up the Tunisian Arabish Corpus (TArC) from the Arabizi text. The other result of the second annotation phase is indeed the corpus itself. This is described, along with information about the amount of data and metadata which it includes.Finally, in Chapter 4 we addressed the preliminary investigation. Indeed, the analyses were aimed at delineating the nature of the corpus itself and at undertaking computational-linguistic strategies to observe the linguistic reality of Tunisian Arabizi.

Ce travail vise à étudier l'arabe tunisien et, en même temps, à apporter une réponse au manque d'outils pour soutenir la recherche sur l'arabe tunisien. En particulier, l'objectif était de construire un corpus se prêtant à différents types d'analyses linguistiques, d'où les nombreux niveaux d'annotation dont nous l'avons doté. En particulier, nous avons restreint notre perspective à une variété spécifique de l'arabe tunisien, qui est celle utilisée pour la communication numérique et le partage d'identité, c'est-à-dire ce que nous avons défini comme Digital Networked Writing (DNW). De plus, nous avons collecté des textes encodés dans le système d'écriture typique dédié à ce contexte, c'est-à-dire l'arabizi.Dans le chapitre 1, nous avons présenté la complexité multilingue du tunisien, en fournissant au lecteur quelques clés fondamentales pour saisir les implications possibles de l'utilisation de ce système d'écriture. Parmi ces points clés, nous avons considéré l'histoire de ce pays, la classification dialectologique traditionnelle de l'arabe tunisien, l'interprétation historique et moderne de la répartition des différentes entités linguistiques sur le territoire. Le thème du plurilinguisme, en particulier, a servi de porte d'entrée aux questions ultérieures sur l'émergence d'une variété urbaine par rapport aux autres. Ici, en fait, nous avons mentionné la diffusion du tunisien dans la Communication Médiatisée par Ordinateur (CMO), et en particulier le double mode d'écriture, la digraphie. Enfin, nous avons présenté les principales caractéristiques de l'arabe tunisien.Dans le chapitre 2, après une brève introduction, nous avons exposé notre méthodologie. Tout d'abord, nous avons décrit les caractéristiques structurelles de la DNW et la manière dont elle est abordée par la recherche linguistique. Nous avons ensuite développé les bonnes pratiques à observer dans la construction de corpus linguistiques. Enfin, afin d'accélérer le processus et d'assurer la reproductibilité de la méthodologie adoptée, mais aussi d'étendre le bassin d'utilité tant du corpus que de la procédure elle-même, nous avons opté pour l'utilisation de techniques d'apprentissage profond (Deep Learning). Au final, nous avons fait le point sur l'état de l'art avec l’objective de partager des informations sur les différents types de travaux menés avec une méthodologie similaire à la nôtre sur l'Arabe Standard Moderne (ASM) et l'arabe dialectal. Un deuxième objectif était de mettre en évidence le manque de ressources disponibles pour soutenir la recherche sur l'arabe tunisien.Le chapitre 3 a traité des opérations spécifiques qui nous ont conduits à la création du corpus. Le chapitre a retracé le chemin à rebours, en commençant par la collecte des données et les décisions prises pour la sélection et la collecte des métadonnées des textes. Nous avons également décrit les étapes d'annotation semi-automatique du corpus dans ses couches d'annotation : classification au niveau des mots, translittération en caractères arabes, tokénisation, étiquetage de la partie du discours et lemmatisation. Enfin, nous avons décrit les résultats de la procédure d’annotation, réalisé à travers d’une architecture multi-tâches de prédiction de séquence. C'est l'outil qui a été construit pour produire les différentes couches d'annotation qui constituent le Tunisian Arabish Corpus (TArC) à partir du texte arabizi. L'autre résultat de la deuxième phase d'annotation est le corpus lui-même. Il est décrit, ainsi que des informations sur la quantité de données et de métadonnées qu'il comprend.Enfin, dans le chapitre 4, nous avons abordé l'enquête préliminaire. Les analyses ont visé à délimiter la nature du corpus lui-même et à entreprendre des stratégies linguistico-computationnelles pour observer la réalité linguistique de l’arabizi tunisien. Chacune de ces analyses est reproductible grâce à la mise à disposition des petits scripts utilisés pour les réaliser.

Tunisian Arabizi: Linguistic Analyses and Corpus Building using Natural Language Processing Arabizi tunisien : Analyses linguistiques et création d'un corpus par le biais du traitement automatique des langues (TAL) En Fr

Fiche du document

Mots-clés En Fr

Sujets proches En

Citer ce document

Métriques

Partage / Export

Résumé En Fr

Par les mêmes auteurs

Sur les mêmes sujets

Sur les mêmes disciplines

Exporter en