Books of Hours: the First Liturgical Corpus for Text Segmentation



May 11, 2020


Archives ouvertes




text segmentation books of hours structural scheme hierarchical segmentation

Cite this document

Amir Hazem et al., « Books of Hours: the First Liturgical Corpus for Text Segmentation », Institut de recherche et d'histoire des textes, ID : 10670/1.ni4xsu


Share / Export

Abstract En

The Book of Hours was the bestseller of the late Middle Ages and Renaissance. It is a historical invaluable treasure, documentingthe devotional practices of Christians in the late Middle Ages. Up to now, its textual content has been scarcely studied because of itsmanuscript nature, its length and its complex content. At first glance, it looks too standardized. However, the study of book of hoursraises important challenges: (i) in image analysis, its often lavish ornamentation (illegible painted initials, line-fillers, etc.), abbreviatedwords, multilingualism are difficult to address in Handwritten Text Recognition (HTR); (ii) its hierarchical entangled structure offers anew field of investigation for text segmentation; (iii) in digital humanities, its textual content gives opportunities for historical analysis.In this paper, we provide the first corpus of books of hours, which consists of Latin transcriptions of 300 books of hours generated byHandwritten Text Recognition (HTR) - that is like Optical Character Recognition (OCR) but for handwritten and not printed texts. Wedesigned a structural scheme of the book of hours and annotated manually two books of hours according to this scheme. Lastly, weperformed a systematic evaluation of the main state of the art text segmentation approache

document thumbnail

From the same authors

On the same subjects

Within the same disciplines

Export in