Logical Layout Analysis Applied to Historical Newspapers

Fiche du document


19 décembre 2021

Type de document

Archives ouvertes


http://creativecommons.org/licenses/by/ , info:eu-repo/semantics/OpenAccess

Citer ce document

Nicolas Gutehrlé et al., « Logical Layout Analysis Applied to Historical Newspapers », HAL-SHS : linguistique, ID : 10670/1.wsa31l


Partage / Export

Résumé En

In recent years, libraries and archives led important digitisation campaigns that opened the access to vast collections of historical documents. While such documents are often available as XML ALTO documents, they lack information about their logical structure. In this paper, we address the problem of logical layout analysis applied to historical documents. We propose a method which is based on the study of a dataset in order to identify rules that assign logical labels to both block and lines of text from XML ALTO documents. Our dataset contains newspapers in French, published in the first half of the 20th century. The evaluation shows that our methodology performs well for the identification of first lines of paragraphs and text lines, with F1 above 0.9. The identification of titles obtains an F1 of 0.64. This method can be applied to preprocess XML ALTO documents in preparation for downstream tasks, and also to annotate largescale datasets to train machine learning and deep learning algorithms.

document thumbnail

Par les mêmes auteurs

Sur les mêmes sujets

Sur les mêmes disciplines

Exporter en