20 mai 2024
info:eu-repo/semantics/OpenAccess
Soline Felice et al., « Audiocite.net: A Large Spoken Read Dataset in French », HAL-SHS : sciences de l'information, de la communication et des bibliothèques, ID : 10670/1.5spjw3
The advent of self-supervised learning (SSL) in speech processing has allowed the use of large unlabeled datasetsto learn pre-trained models, serving as powerful encoders for various downstream tasks. However, the applicationof these SSL methods to languages such as French has proved difficult due to the scarcity of large French speechdatasets. To advance the emergence of pre-trained models for French speech, we present the Audiocite.netcorpus composed of 6 682 hours of recordings from 130 readers. This corpus is built from audiobooks fromthe audiocite.net website. In addition to describing the creation process and final statistics, we also show howthis dataset impacted the models of LeBenchmark project in its 14k version for speech processing downstream tasks.