In September 2024, the CENL “AI in Libraries” Network Group will host webinars on various uses of Artificial Intelligence (AI) in national libraries.
The online events last approximately 45-60 minutes each.
For more information please see below and/or contact Jean-Philippe Moreux from the National Library of France, the chair of the group at jean-philippe.moreux@bnf.fr.
Javier De La Rosa (NLB)
Javier de la Rosa is a Senior Research Scientist at the Artificial Intelligence Lab at the National Library of Norway. A former Postdoctoral Fellow at UNED Digital Humanities Innovation Lab, he holds a PhD in Hispanic Studies with a specialization in Digital Humanities by the University of Western Ontario, and a Masters in Artificial Intelligence by the University of Seville. Javier has previously worked as a Research Engineer at the Stanford University Center for Interdisciplinary Digital Research, and as the Technical Lead at the University of Western Ontario CulturePlex Lab for Cultural Complexity. He is interested in Natural Language Processing applied to historical and literary text, with a special focus on large language models.
Abstract: The Mímir Project is an initiative by the Norwegian government that aims to assess the significance and influence of copyrighted materials in the development and performance of generative large language models (LLMs) tailored to the Norwegian languages. This collaborative effort involves three leading institutions from different regions of the country: the National Library of Norway (NB), the University of Oslo (UiO), and the Norwegian University of Science and Technology (NTNU); each contributing unique expertise in language technology, corpus curation, model training, copyright law, and computational linguistics. Additionally, the project has been supported with computational resources provided by Sigma2. The ultimate goal of the project is to gather empirical evidence that will inform the formulation of a compensation scheme for authors whose works are utilized by these advanced artificial intelligence (AI) systems, ensuring that intellectual property rights are respected and adequately compensated.
Monday 16 September 2024, 11:00 am CEST
Per Egil Kummervold (NLN)
Senior Researcher at the National Library of Norway where he has worked since September 2020. He has been responsible for the Norwegian Transformer Model Project (training the first Norwegian BERT-model and creating the Norwegian Colossal Corpus), and then for the Norwegian Speech Transformer Model Project (NOSTRAM) (training a Norwegian Whisper model).
Abstract: The NOSTRAM-project uses library resources to build a 20k hours aligned speech corpus. The corpus comes from a variety of sources, and are using various transcription styles. We use a diverse set of techniques to clean the corpus so that it is suitable for ASR training. In general we are able to improve on the OpenAI model considerably. We show that we are able to improve the Norwegian Bokm˚al transcription by OpenAI Whisper Large-v3 from a WER of 10.4 to 6.6 on the Fleurs Dataset and from 6.8 to 2.2 on the NST dataset.
Lars Mydtskov, Lasse Rogers, Ditte Laursen (RDL)
Abstract: This project aims to leverage the implementation of advanced Automatic Speech Recognition (ASR) systems to enable text-based searches within the Danish Royal Library’s extensive radio and television archives. Based on the Whisper ASR model, we will demonstrate how transcribing audiovisual materials can make these vast archives significantly more accessible and searchable for research.
Established in 1987, the State Media Archive was created to collect and preserve Danish radio and television broadcasts for future historical research. The collection includes nationwide public service broadcasts from the mid-1980s onward, supplemented with older broadcasts. The 2005 legal deposit law revision included radio and television, ensuring comprehensive digital collection from significant nationwide channels, while selectively collecting from others. Today, the digital archive holds over three million broadcasts from around 80 Danish radio and television stations, with 1,500 broadcasts added daily, and efforts are ongoing to digitize over 150,000 analogue tapes from before 2005.
Traditional indexing methods cannot efficiently manage this immense and diverse corpus. This is due to the fact that metadata varies significantly between different channels and changes over time. Additionally, some channels, broadcasts or periods lack metadata altogether. Based on a selective part of the collection, primarily older sourced materials, this project focuses on fine-tuning the Whisper ASR model for Danish-language content in order to make the audiovisual resources text-searchable for research. The process includes several steps: Feature extraction, Tokenizing, Low-Rank adapters (LoRA).
Thursday 26 September, 11:00 am CEST
Lisa Kluge (DNB)
Lisa Kluge is a research associate at the German National Library. There, she is part of the library’s project on investigating methods from artificial intelligence and natural language processing on the task of subject indexing digital publications. At the Saarland University, she completed a bachelor’s degree in computational linguistics, followed by a master’s degree in language science and technology. Currently, she mainly works with large language models. She is passionate about all aspects of language and how modern technologies can analyse, utilise and reproduce it.
Abstract: With the rise of large language models (LLMs), many tasks of natural language processing (NLP) have reached unprecedented performance levels. One task where LLMs have not yet beaten traditional methods is subject indexing with a large controlled target vocabulary.
In this presentation we will show how a variety of Open-Source LLMs are applied to the task of subject indexing on a dataset of German book titles, compiled at the German National Library (DNB). The results are compared to one closed source LLM and two baseline methods that are already in productive use at the DNB. We will discuss strengths and weaknesses of these approaches.
Thursday 17 October, 11:00 am CEST
Jean-Philippe Moreux (BnF – National Library of France)
Jean-Philippe Moreux is the BnF AI head of mission. He works on the BnF heritage digitization, digital humanities an AI programs. He participates in national and international research projects on these topics. Prior to that, he was an IT R&D Engineer and project manager, and worked as a science editor and a consultant in the publishing industry. He’s also the CENL “AI in Libraries” Network Group chairman, and a member of the AI4LAM Council.
Abstract: Taking advantage of AI’s technical advances means pursuing a policy of major projects, whose ambitions and visibility will be success factors. But AI can also provide services on a day-to-day basis, in the tasks and activities commonly encountered in LAMs and carried out by the operational teams of the institutions concerned. These tasks are varied in nature and scope, drawing on multiple skills and taking place in sectors that are sometimes far removed from where the AI expertise is located. What’s more, these operational teams are often keen to improve their digital tooling, which more often than not consists of siloed internal applications and Excel. At BnF, the desire to bring these operational teams closer to AI led to the launch of an experiment based on:
Thursday 24 October, 11:00 am CEST