Using n-akṣaras to model Sanskrit & Sanskrit-adjacent texts

Fiche du document

Auteur
Date

12 janvier 2023

Discipline
Type de document
Périmètre
Langue
Identifiants
Relations

Ce document est lié à :
info:eu-repo/semantics/altIdentifier/arxiv/2301.12969

Collection

Archives ouvertes

Licences

http://creativecommons.org/licenses/by-nc/ , info:eu-repo/semantics/OpenAccess



Sujets proches En

Pattern Model

Citer ce document

Charles Li, « Using n-akṣaras to model Sanskrit & Sanskrit-adjacent texts », HAL-SHS : littérature, ID : 10670/1.lher44


Métriques


Partage / Export

Résumé En

Despite — or perhaps because of — their simplicity, n-grams, or contiguous sequences of tokens, have been used with great success in computational linguistics since their introduction in the late 20th century. Recast as k-mers, or contiguous sequences of monomers, they have also found applications in computational biology. When applied to the analysis of texts, n-grams usually take the form of sequences of words. But if we try to apply this model to the analysis of Sanskrit texts, we are faced with the arduous task of, firstly, resolving sandhi to split a phrase into words, and, secondly, splitting long compounds into their components. This paper presents a simpler method of tokenizing a Sanskrit text for n-grams, by using n-akṣaras, or contiguous sequences of akṣaras. This model reduces the need for sandhi resolution, making it much easier to use on raw text. It is also possible to use this model on Sanskrit-adjacent texts, e.g., a Tamil commentary on a Sanskrit text. As a test case, the commentaries on Amarakoṣa 1.0.1 have been modelled as n-akṣaras, showing patterns of text reuse across ten centuries and nine languages. Some initial observations are made concerning Buddhist commentarial practices.

document thumbnail

Par les mêmes auteurs

Sur les mêmes sujets

Sur les mêmes disciplines

Exporter en