17 octobre 2024
http://creativecommons.org/licenses/by/ , info:eu-repo/semantics/OpenAccess
Sergio Torres Aguilar, « Handwritten Text Recognition for Historical Documents using Visual Language Models and GANs », HALSHS : archive ouverte en Sciences de l’Homme et de la Société, ID : 10670/1.ehs99r
In this study, we focus on Handwriting Text Recognition (HTR) on Medieval and Early Modern documentary manuscripts (10th-16th centuries) using Vision Language models (VLM). We leverage the TrOCR architecture and integrate domain-specific large language models (LLM). This HTR approach show promising results on contemporary documents but the application on historical manuscripts and low-resource languages need domain pre-trained Image models to encode sequential data and adapted LLM's to adequately decode the signals. Furthermore, as training pairs from medieval manuscripts are scarce a synthetic dataset generated using GAN (Generative Adversarial Networks) augmentation techniques will be used during training. For this work, the annotated training dataset comprises more of 2 million tokens and 210,000 graphical text-lines coming from 52 different manuscripts in mostly four ancient languages versions for Latin, French, Spanish and High German. The synthetic GAN dataset comprises 420k graphical lines emulating textual and graphical features from the ground-truth. The results shows relative improvements until 30% in CER, WER and BERT-score compared to CRNN only solutions. This study outlines the following: the training architecture and corpora employed; delves into the encountered challenges during training and validation concerning ancient writing practices, and conducts an analysis of the potential biases and strengths associated with the joint application of vision transformers, GAN's and LLMs for HTR tasks.