Audiocite.net: A Large Spoken Read Dataset in French

Fiche du document

Date

20 mai 2024

Discipline
Type de document
Périmètre
Langue
Identifiants
Collection

Archives ouvertes

Licence

info:eu-repo/semantics/OpenAccess




Citer ce document

Soline Felice et al., « Audiocite.net: A Large Spoken Read Dataset in French », HAL-SHS : sciences de l'information, de la communication et des bibliothèques, ID : 10670/1.5spjw3


Métriques


Partage / Export

Résumé En

The advent of self-supervised learning (SSL) in speech processing has allowed the use of large unlabeled datasetsto learn pre-trained models, serving as powerful encoders for various downstream tasks. However, the applicationof these SSL methods to languages such as French has proved difficult due to the scarcity of large French speechdatasets. To advance the emergence of pre-trained models for French speech, we present the Audiocite.netcorpus composed of 6 682 hours of recordings from 130 readers. This corpus is built from audiobooks fromthe audiocite.net website. In addition to describing the creation process and final statistics, we also show howthis dataset impacted the models of LeBenchmark project in its 14k version for speech processing downstream tasks.

document thumbnail

Par les mêmes auteurs

Sur les mêmes sujets

Sur les mêmes disciplines

Exporter en