A processing chain for extracting and providing online access to annotated and semantically enriched historical data. The AGODA project

Fiche du document

Date

25 juillet 2022

Type de document
Périmètre
Langue
Identifiants
Collection

Archives ouvertes

Licence

http://creativecommons.org/licenses/by/




Citer ce document

Pierre Vernus et al., « A processing chain for extracting and providing online access to annotated and semantically enriched historical data. The AGODA project », HAL-SHS : histoire, ID : 10670/1.xaews4


Métriques


Partage / Export

Résumé En

The AGODA project is one of five pilot projects supported by the DataLab of the Bibliothèque nationale de France. It aims to create an online platform facilitating the exploration and use of the parliamentary debates of the Chamber of Deputies published in the Journal officiel from 1881 to 1940. In the framework of the DataLab, we are working on a test subcorpus, namely the parliamentary cycle from 1889 to 1893, to test our hypotheses on a smaller dataset. Over the past sixty years, a great deal of work has been done on parliamentary debates. It is indeed a valuable sourcefor historians, political scientists, sociologists or linguists. Access to digitised and ocerised debates thus seems to have a positive effect on the number of historical works using these documents. The same effect can be observed for other disciplines using contemporary debates. AGODAis thus part of a wider movement to facilitate the use and analysis of parliamentary data, following the example of ParlaClarin and ParlaMint, which propose to produce comparable and multilingual Parliamentary Proceedings Corpora according to the XML-TEI standard. Naomi Truan has also produced a corpus of parliamentary debates encoded in XML-TEI.The production of this type of resource facilitates the publication of works exploiting this data to better understand French political discourse. Between 1881 and 1899, 2596 issues of the Journal Officiel were published (50791 JPG images). The debates are also in TXT format but put online without extensive post-correction: the quality of the OCR is not sufficient to provide a satisfactory online browsing experience, and it could have a negative impact on the analyses performed on these texts. Therefore, we chose to ocerise the text, to obtain a better-quality result. We use the PERO OCR based solution developed by the SODUCO project . Ocerised texts are obtained in JSON format; we are developing Python scripts to convert this output into an XML file corresponding to the chosen TEI model. This model is formalised with an adapted XML schema, created using an ODD. We chose to use the ODD created by ParlaClarin which can be easily adapted to annotate historical parliamentary debates. In the case of France, the rules for transcribing debates were set in the 19th century; thus, the recordings of today's debates are very similar to those produced during the Third Republic. The TEI-encoded corpus will be stored in an eXist-db database, and it will be visualised using the TEI Publisher application, which can transform the source data into HTML web pages. The parliamentary debates will thus be made available to online users as a digital edition and integrated into an application context.We will also present the first analyses we have carried out on this corpus with "bag-of-words" techniques - these being not too sensitive to the quality of the OCR. We first used topic modelling, an unsupervised learning method that allows us to discover the latent semantic structures of a corpus of texts, without using semantic and lexical resources. This method is well suited to study parliamentary debates. Alternatively, we can use word embeddings to reduce the dimension of the original space from several tens of thousands of forms to a hundred axes, and then apply classical data science tools such as clustering or correlation analysis on the reduced space. Word embedding has thus shown its interest in the studyof parliamentary debates. We used a continuous bag-of-words model for dimension reduction and an unsupervised classification algorithm - in this case DBSCAN - to group words into clusters.

document thumbnail

Par les mêmes auteurs

Sur les mêmes sujets

Exporter en