The amount of digital data is constantly increasing: extensive digitization of paper archives and more active archiving of digital material are creating growing collections of data. Machine-readable data opens up new opportunities for combining and enriching information and for other data processing. However, enrichment allowing us to make full use of mass data requires advanced automation because humans processing the material cannot keep pace with the generation of digital data. Artificial intelligence methods play a key role in the automation of the description and other processing workflows.
In the High-Performance Digitisation project, CSC, the National Archives and the National Library of Finland are jointly tackling the challenge of artificial intelligence and data processing. The purpose of the project is to create a service for memory organizations facilitating the processing of data: absence or inadequacy of metadata and poor search functions make digital material more difficult to use. The aim is to create an intelligent annotation pipeline for semi-automated annotation (adding metadata) and enrichment of archived material, such as newspapers, books and official documents.
The annotation pipeline using artificial intelligence will be implemented in CSC’s supercomputer environment, in which it can be offered as a service to memory organizations or duplicated to memory organizations’ environments. The Annif software, developed by the National Library of Finland, will serve as the automated subject indexing and classification tool in the pipeline. Read more about Annif at annif.org.
The National Library of Finland has provided CSC with material for test use, while CSC has used the material to conduct tests requiring high-performance computing and recommended workable solutions and new algorithms for Annif. This has allowed the National Library to examine and boost the performance of Annif: Omikuji algorithmsintroduced as part of the project have been one factor improving the quality of the suggested subject headings.
A practically oriented report (Proof of Concept) on integrating Annif into the National Library’s existing description processes is also being prepared as part of the project. This is also in line with the National Library’s metadata vision: the description should be based on semi-automated systems that do not replace humans but are intended to speed up the description process. Under the metadata vision, these systems should also be able to learn by using the concepts selected by humans.
The report will contain a review of the existing processes and put forward proposals for Annif’s role in them. The best updating processes of the service, model and vocabulary change management, and access rights management will also be described in the report. The report will serve as a preliminary work plan for the practical commissioning process.
Moreover, CSC and the National Library of Finland will jointly prepare a general description (whitepaper) of the use of machine learning methods in the automated description service of the National Library and other memory organizations. The aim of the National Archives of Finland is to develop methods for processing the material created in mass digitization (for example, through automated text recognition).
The High-Performance Digitisation project is co-funded by the Connecting Europe Facility of the European Union. In addition to using funding provided for the CSC-managed project, the National Archives and the National Library have also drawn on their own funding resources. In fact, the project results will also have wider applications in European memory organizations and the descriptive metadata will meet the Metadata Quality Assurance (MQA) requirements of the European Data Portal.
The aim of the project is to develop an automated subject cataloging service for the National Library of Finland. Annif has been in test use in Osuva (the open publications archive of the University of Vaasa) since spring 2020. The archive is maintained by the National Library. After integration, the input form is processed as follows: a student (or researcher or other author) enters the text, which is sent to Annif via an interface. The student can accept or reject the suggestions made by Annif and add their own subject headings or keywords. The subject headings that Annif would like to subject to quality assurance and for further processing and those selected by users are saved.
A video presenting Osuva integration is available in Doria. A similar Annif integration has been in use in the University of Jyväskylä’s JYX archive for many years and the feedback on the tool provided by JYX has been positive. Read the presentation given by Ari Häyrinen at Library Network Days in 2019 to learn more about the use of Annif in JYX (pdf, in Finnish).
In the future, Annif will be introduced in other publication archives maintained by the National Library. The National Library has also launched the Finto AI service, which is the Annif version intended for production use. More extensive Annif applications (especially the use of the annotation pipeline designed in this project) will have to be discussed in follow-up projects. Naturally, we would like to continue the successful cooperation built around the project.
The parties involved had not engaged in any cooperation on automated description before this project. However, the project was launched quickly and without problems, which suggests that the topic is highly relevant and that there is a need for cooperation between a wide range of different actors. The project has made rapid progress and is more or less on schedule but, unfortunately, the productization start has been delayed because of the coronavirus pandemic and recruitment challenges. For this reason, it was decided to extend the project until the end of 2020 and the parties are currently exploring how to continue the fruitful cooperation.
Natural language processing (NLP) techniques have made huge progress during the project. The GPT neural network model of the OpenAI community has attracted the most attention but such developments as the introduction of more advanced BERT models have been more relevant to automated subject cataloging. It is therefore expected that there will also be rapid progress in the accuracy and comprehensiveness of natural language processing methods in the future and taking part in this process will provide a basis for the development of better automatic processing workflows.
The service will reach an advanced development stage during the project and preliminary results of its use will become available. However, it is clear that the development of automated processes using artificial intelligence is at its initial stages: creating an entirely new operating model is a long process and requires continuous learning over the coming years. Integrating artificial intelligence supporting human activity into the operations of memory organizations and well-established working methods will be a challenging task.
The article in Finnish: High Performance Digitisation -hankkeella vauhtia digitaalisten aineistojen kuvailuun.