Audiocite.net: A Large Spoken Read Dataset in French

Soline Felice et al., « Audiocite.net: A Large Spoken Read Dataset in French », HAL-SHS : sciences de l'information, de la communication et des bibliothèques, ID : 10670/1.5spjw3

Partage / Export

Résumé En

The advent of self-supervised learning (SSL) in speech processing has allowed the use of large unlabeled datasetsto learn pre-trained models, serving as powerful encoders for various downstream tasks. However, the applicationof these SSL methods to languages such as French has proved difficult due to the scarcity of large French speechdatasets. To advance the emergence of pre-trained models for French speech, we present the Audiocite.netcorpus composed of 6 682 hours of recordings from 130 readers. This corpus is built from audiobooks fromthe audiocite.net website. In addition to describing the creation process and final statistics, we also show howthis dataset impacted the models of LeBenchmark project in its 14k version for speech processing downstream tasks.

Audiocite.net: A Large Spoken Read Dataset in French

Fiche du document

Mots-clés En

Sujets proches En

Citer ce document

Métriques

Partage / Export

Résumé En

Par les mêmes auteurs

Sur les mêmes sujets

Sur les mêmes disciplines

Exporter en