SegmOnto: A Controlled Vocabulary to Describe and Process Digital Facsimiles

Fiche du document

Date

17 décembre 2024

Type de document
Périmètre
Langue
Identifiants
Relations

Ce document est lié à :
info:eu-repo/semantics/altIdentifier/doi/10.46298/jdmdh.12689

Collection

Archives ouvertes

Licences

http://creativecommons.org/licenses/by/ , info:eu-repo/semantics/OpenAccess




Citer ce document

Simon Gabay et al., « SegmOnto: A Controlled Vocabulary to Describe and Process Digital Facsimiles », HAL SHS (Sciences de l’Homme et de la Société), ID : 10.46298/jdmdh.12689


Métriques


Partage / Export

Résumé En

Our initiative aims at designing a controlled vocabulary for the description of the layout of textual sources: SegmOnto. Following a more physical approach rather than a strictly semantic one, it is designed as a pragmatic and generic typology, coping with most of the Western historical documents rather than answering specific needs. The harmonisation of the layout description has a double objective: on the one hand it facilitates the mutualisation of annotated data and therefore the training of better models for page segmentation (a crucial preliminary step for text recognition), on the other hand it allows the development of a shared post-processing workflow and pipeline for the transformation of ALTO or PAGE files into DH standard formats, which preserves as much as possible the link between the extracted information and the digital facsimile. To demonstrate the capacity of SegmOnto to answer both these objectives, we aggregate data from multiple projects to train a layout analysis model, and we propose a prototype of a generic pipeline for converting ALTO-XMLs into XML-TEI.

document thumbnail

Par les mêmes auteurs

Sur les mêmes sujets

Sur les mêmes disciplines