Point Break: Surfing Heterogeneous Data for Subtitle Segmentation

Fiche du document

Date

3 septembre 2021

Discipline
Périmètre
Langue
Identifiants
Collection

OpenEdition Books

Organisation

OpenEdition

Licences

https://www.openedition.org/12554 , info:eu-repo/semantics/openAccess


Résumé 0

Subtitles, in order to achieve their purpose of transmitting information, need to be easily readable. The segmentation of subtitles into phrases or linguistic units is key to their readability and comprehension. However, automatically segmenting a sentence into subtitles is a challenging task and data containing reliable human segmentation decisions are often scarce. In this paper, we leverage data with noisy segmentation from large subtitle corpora and combine them with smaller amounts of high-quality data in order to train models which perform automatic segmentation of a sentence into subtitles. We show that even a minimum amount of reliable data can lead to readable subtitles and that quality is more important than quantity for the task of subtitle segmentation.

document thumbnail

Par les mêmes auteurs

Sur les mêmes sujets

Sur les mêmes disciplines

Exporter en