Detecting Sexual Content at the Sentence Level in First Millennium Latin Texts

Thibault Clérice

Detecting Sexual Content at the Sentence Level in First Millennium Latin Texts

Fiche du document

Auteur

Thibault Clérice

Date

20 mai 2024

Type de document

Colloques et conférences

Périmètre

Publications

Langue

Anglais

Identifiants

handle: 10670/1.7kb7sj
hal: hal-04214375
ARXIV: 2309.14974

Source

HAL-SHS : linguistique

Relations

Ce document est lié à :
info:eu-repo/semantics/altIdentifier/arxiv/2309.14974

Collection

Archives ouvertes

Organisation

Centre pour la communication scientifique directe

Licences

http://creativecommons.org/licenses/by/ , info:eu-repo/semantics/OpenAccess

Mots-clés En

Latin Sentence classification Figurative speech Sexuality Low-resource

Sujets proches En

Knowledge, Classification of

Citer ce document

Thibault Clérice, « Detecting Sexual Content at the Sentence Level in First Millennium Latin Texts », HAL-SHS : linguistique, ID : 10670/1.7kb7sj

Partage / Export

Résumé En

In this study, we propose to evaluate the use of deep learning methods for semantic classification at the sentence level to accelerate the process of corpus building in the field of humanities and linguistics, a traditional and time-consuming task. We introduce a novel corpus comprising around 2500 sentences spanning from 300 BCE to 900 CE including sexual semantics (medical, erotica, etc.). We evaluate various sentence classification approaches and different input embedding layers, and show that all consistently outperform simple token-based searches. We explore the integration of idiolectal and sociolectal metadata embeddings (centuries, author, type of writing), but find that it leads to overfitting. Our results demonstrate the effectiveness of this approach, achieving high precision and true positive rates (TPR) of respectively 70.60% and 86.33% using HAN. We evaluate the impact of the dataset size on the model performances (420 instead of 2013), and show that, while our models perform worse, they still offer a high enough precision and TPR, even without MLM, respectively 69% and 51%. Given the result, we provide an analysis of the attention mechanism as a supporting added value for humanists in order to produce more data.

Detecting Sexual Content at the Sentence Level in First Millennium Latin Texts

Fiche du document

Mots-clés En

Sujets proches En

Citer ce document

Métriques

Partage / Export

Résumé En

Par les mêmes auteurs

Sur les mêmes sujets

Exporter en