Books of Hours: the First Liturgical Corpus for Text Segmentation

Metadatas

Date

May 11, 2020

Discipline
Language
Identifiers
Collection

Archives ouvertes

License

info:eu-repo/semantics/OpenAccess


Keywords

text segmentation books of hours structural scheme hierarchical segmentation


Cite this document

Amir Hazem et al., « Books of Hours: the First Liturgical Corpus for Text Segmentation », Institut de recherche et d'histoire des textes, ID : 10670/1.ni4xsu


Metrics


Share / Export

Abstract En

The Book of Hours was the bestseller of the late Middle Ages and Renaissance. It is a historical invaluable treasure, documentingthe devotional practices of Christians in the late Middle Ages. Up to now, its textual content has been scarcely studied because of itsmanuscript nature, its length and its complex content. At first glance, it looks too standardized. However, the study of book of hoursraises important challenges: (i) in image analysis, its often lavish ornamentation (illegible painted initials, line-fillers, etc.), abbreviatedwords, multilingualism are difficult to address in Handwritten Text Recognition (HTR); (ii) its hierarchical entangled structure offers anew field of investigation for text segmentation; (iii) in digital humanities, its textual content gives opportunities for historical analysis.In this paper, we provide the first corpus of books of hours, which consists of Latin transcriptions of 300 books of hours generated byHandwritten Text Recognition (HTR) - that is like Optical Character Recognition (OCR) but for handwritten and not printed texts. Wedesigned a structural scheme of the book of hours and annotated manually two books of hours according to this scheme. Lastly, weperformed a systematic evaluation of the main state of the art text segmentation approache

document thumbnail

From the same authors

On the same subjects

Within the same disciplines

Export in