REPUBLIC aims to provide access to the scans, texts and annotation-layers of over half a million pages of handwritten and printed political information. In order to accomplish this, historical expertise, machine learning and public participation are used.
1. Digitizing Sources in the National Archives
The enormous archives of the Dutch States General (around 1200 meters) are stored in the Dutch National Archives in The Hague. The resolutions constitute the core of these archives. Nobody knows exactly how many resolutions exist, but it has been estimated that there are around a million of them, bundled in hundreds of hefty volumes. All resolution volumes have been scanned by the National Archives.
2. Automatic Text Recognition
The resolutions of the States General all exist in handwritten form (1576-1796) and partly also in printed form (1703-1796). To prepare the resolutions for research, it is necessary to first convert them to machine readable text. We do this with state-of-the-art tools for Handwritten Text Recognition (HTR) and Optical Character Recognition (OCR).
For the handwritten resolutions we use the platform Transkribus. The HTR-software is trained specifically on the resolutions by first creating around a thousand pages of manual transcriptions. These are then used as training and evaluation material for the software. The resulting model is used to make automatic transcriptions of the remaining resolutions.
For the recognition of the printed resolutions (OCR) we use the open source software Tesseract, and train a specialized model on the resolution texts in order to achieve a high confidence level
3. Revision by the crowd on the platform ‘VeleHanden’
After training the initial text recognition models, the computer is already able to distinguish many characters in the resolutions, but it still has to learn more. For this, we collaborate with a large group of volunteers on the crowdsourcing platform VeleHanden: experts in paleography and online publication of historical sources who, for different reasons, share their expertise and time in order to get this massive job done. All corrections are then fed back into Transkribus for improved results. Until the computer is able to do this by itself (which means, when 98% of the handwritten texts are correctly recognized). We estimate that we need about 50.000 manually corrected scans for this.
4. Structure & Indexing
In order to provide even better access to the resolutions of the States General, we structure the vast amount of resulting data. Who are mentioned in the resolutions, which locations and institutions were involved and what are the most common themes (the so-called named entities)? We do this by dividing the text into logical elements such as session days, dates, attendance records and resolutions. This enables us to link the entities to the corresponding texts and thus add more context to the resolutions. Additionally, we use indexes that were created in the time of the States General. We strive to combine the entities in a coherent framework and give insight in combinations of locations, persons, institutions and subjects. We have calculated that there are between two and three million of such structural elements in the resolutions, in addition to a plethora of different topics.
5. Data-storage and online publication
All transcriptions are stored and managed in a text repository. The intermediate results of the different steps leading to transcriptions are also saved, so that every version of every text remains available. This text repository can be complemented in a variety of ways (by data managers or by periodically and automatically requesting new material from other systems) and can be used for different purposes and target audiences. In this way, texts can be queried and saved by REPUBLIC team-members, by the scientific staff of the KNAW Humanities Cluster and also by other interested researchers, who want to get insight in the raw data.In the same manner, the original scans are made available for reuse in an image repository, following the IIIF-standards (International Image Interoperability Framework). Finally, we will build a public online environment that allows researchers and other interested parties to explore, search and analyze all digital. This research environment will present the scans, the texts, metadata, summarized statistics and the named entities.
Facilitating large-scale research in politics and political culture
It is only when all resolutions from the period 1576-1796 are digitally available in a coherent format and design, that we can really start innovative and large-scale research in the field of political history of the Dutch States General and the Dutch Republic. Research questions that could be addressed with the project’s final results include:
- Questions relating to early modern institutional innovation, political reconstruction, regime change, network formation, political language use and representation.
- Research of the relative position and wealth of the provinces, the competition between the army and the fleet, the importance of the different colonies and the treatment of diverse religious groups.
- Research of long-term developments, such as the changes in the treatment of petitions or the fluctuations considering interaction with surrounding states.
- Researching the division between formal and informal politics, governance, the development of political ceremonies or politics ‘behind the scenes’
- Serial research into the presence in meetings and commissions, or matters such as economic and military policy.