Dual-Style Transcription of Historical Manuscripts based on Multimodal Small Language Models with Switchable Adapters

Fiche du document

Date

8 mars 2025

Type de document
Périmètre
Langue
Identifiants
Collection

Archives ouvertes

Licences

http://creativecommons.org/licenses/by-nc/ , info:eu-repo/semantics/OpenAccess




Citer ce document

Sergio Torres Aguilar, « Dual-Style Transcription of Historical Manuscripts based on Multimodal Small Language Models with Switchable Adapters », HAL SHS (Sciences de l’Homme et de la Société), ID : 10670/1.188182...


Métriques


Partage / Export

Résumé 0

The transcription of historical manuscripts presents unique challenges due to complex writing styles, pervasive abbreviations, and significant linguistic heterogeneity. Recent multimodal architectures that integrate powerful language models with vision encoders offer promising and flexible solutions. In this work, we propose a two-stage training strategy to model two main transcription paradigms: diplomatic (abbreviated) and semi-diplomatic (expanded). Our approach leverages two specialized corpora—CATMuS Medieval (abbreviated) and TRIDIS (non-abbreviated). In Stage 1, we pre-train a single model on a random combined dataset using a loss penalization mechanism to discourage the prediction of Medieval Unicode Font Initiative (MUFI) characters during expanded-text generation. In Stage 2, this pre-trained model serves as the foundation to train \emph{separate} LoRA adapters for each transcription style, which can be dynamically switched at inference.We benchmark our approach using MiniCPM-Llama3-V-2.5, Phi-3.5 Vision and Qwen2-VL (all published end-2024) against two established baselines (Kraken v5 and TrOCR). Our results indicate that while the unified “double-head” model provides a solid initialization, fine-tuning separate, switchable LoRA adapters leveraging that base yields progressive superior performance by better adapting to each transcription style. We discuss the limitations of a single adapter and advocate for switched adapters to achieve optimal flexibility. This integrated approach supports historians, linguists, and digital humanities scholars in exploring complex historical corpora at multiple interpretive levels.

document thumbnail

Par les mêmes auteurs

Sur les mêmes sujets