CENL Event

28th August 2024

Network Group “AI in Libraries” Webinars 2024

In September 2024, the CENL “AI in Libraries” Network Group will host three webinars on various uses of Artificial Intelligence (AI) in national libraries.

The online events last approximately 45 minutes each.

For more information please see below and/or contact Jean-Philippe Moreux from the National Library of France, the chair of the group at jean-philippe.moreux@bnf.fr.

 

Impact of copyrighted materials in LLMs: Mímir Project

Javier De La Rosa (NLB)

Javier De La Rosa (NLBJavier de la Rosa is a Senior Research Scientist at the Artificial Intelligence Lab at the National Library of Norway. A former Postdoctoral Fellow at UNED Digital Humanities Innovation Lab, he holds a PhD in Hispanic Studies with a specialization in Digital Humanities by the University of Western Ontario, and a Masters in Artificial Intelligence by the University of Seville. Javier has previously worked as a Research Engineer at the Stanford University Center for Interdisciplinary Digital Research, and as the Technical Lead at the University of Western Ontario CulturePlex Lab for Cultural Complexity. He is interested in Natural Language Processing applied to historical and literary text, with a special focus on large language models.

Abstract: The Mímir Project is an initiative by the Norwegian government that aims to assess the significance and influence of copyrighted materials in the development and performance of generative large language models (LLMs) tailored to the Norwegian languages. This collaborative effort involves three leading institutions from different regions of the country: the National Library of Norway (NB), the University of Oslo (UiO), and the Norwegian University of Science and Technology (NTNU); each contributing unique expertise in language technology, corpus curation, model training, copyright law, and computational linguistics. Additionally, the project has been supported with computational resources provided by Sigma2. The ultimate goal of the project is to gather empirical evidence that will inform the formulation of a compensation scheme for authors whose works are utilized by these advanced artificial intelligence (AI) systems, ensuring that intellectual property rights are respected and adequately compensated.

Tuesday 10 September 2024, 11:00 am CEST

Registration: https://bnf-fr.zoom.us/meeting/register/tJcldeqgrD0pGdV1LL6GPbjCc1wC39beEKsX

 

AI for end-users? The BnF experiment program

Jean-Philippe Moreux (BnF – National Library of France)

Photo of Jean-Philippe MoreuxJean-Philippe Moreux is the BnF AI head of mission. He works on the BnF heritage digitization, digital humanities an AI programs. He participates in national and international research projects on these topics. Prior to that, he was an IT R&D Engineer and project manager, and worked as a science editor and a consultant in the publishing industry. He’s also the CENL “AI in Libraries” Network Group chairman, and a member of the AI4LAM Council.

Abstract: Taking advantage of AI’s technical advances means pursuing a policy of major projects, whose ambitions and visibility will be success factors. But AI can also provide services on a day-to-day basis, in the tasks and activities commonly encountered in LAMs and carried out by the operational teams of the institutions concerned. These tasks are varied in nature and scope, drawing on multiple skills and taking place in sectors that are sometimes far removed from where the AI expertise is located. What’s more, these operational teams are often keen to improve their digital tooling, which more often than not consists of siloed internal applications and Excel. At BnF, the desire to bring these operational teams closer to AI led to the launch of an experiment based on:

  • the use of AI platforms like Dataiku, Roboflow, LabelStudio as a common digital workspace, enabling scattered data to be aggregated, processed and then used to produce new data or provide access to new services
  • the creation of IT/business pairs working on a use case provided by business departments
  • coordination of the experiment by BnF’s AI Unit.

Thursday 19 September, 11:00 am CEST

Registration: https://bnf-fr.zoom.us/meeting/register/tJUtcumtrjgvGNWxfRtrVACVQ3bEqllsLoy5

 

Speech to Text

Per Egil Kummervold (NLN)

Photo Per Egil Kummervold (NLN)Senior Researcher at the National Library of Norway where he has worked since September 2020. He has been responsible for the Norwegian Transformer Model Project (training the first Norwegian BERT-model and creating the Norwegian Colossal Corpus), and then for the Norwegian Speech Transformer Model Project (NOSTRAM) (training a Norwegian Whisper model).

Abstract: The NOSTRAM-project uses library resources to build a 20k hours aligned speech corpus. The corpus comes from a variety of sources, and are using various transcription styles. We use a diverse set of techniques to clean the corpus so that it is suitable for ASR training. In general we are able to improve on the OpenAI model considerably. We show that we are able to improve the Norwegian Bokm˚al transcription by OpenAI Whisper Large-v3 from a WER of 10.4 to 6.6 on the Fleurs Dataset and from 6.8 to 2.2 on the NST dataset.

Ditte Laursen (RDL)

Photo Ditte Laursen (RDL)

Ditte Laursen: https://dittelaursen.dk/

Abstract: This project aims to leverage the implementation of advanced Automatic Speech Recognition (ASR) systems to enable text-based searches within the Danish Royal Library’s extensive radio and television archives. Based on the Whisper ASR model, we will demonstrate how transcribing audiovisual materials can make these vast archives significantly more accessible and searchable for research.

Established in 1987, the State Media Archive was created to collect and preserve Danish radio and television broadcasts for future historical research. The collection includes nationwide public service broadcasts from the mid-1980s onward, supplemented with older broadcasts. The 2005 legal deposit law revision included radio and television, ensuring comprehensive digital collection from significant nationwide channels, while selectively collecting from others. Today, the digital archive holds over three million broadcasts from around 80 Danish radio and television stations, with 1,500 broadcasts added daily, and efforts are ongoing to digitize over 150,000 analogue tapes from before 2005.

Traditional indexing methods cannot efficiently manage this immense and diverse corpus. This is due to the fact that metadata varies significantly between different channels and changes over time. Additionally, some channels, broadcasts or periods lack metadata altogether. Based on a selective part of the collection, primarily older sourced materials from DR (Danish Broadcasting Corporation), this project focuses on fine-tuning the Whisper ASR model for Danish-language content in order to make the audiovisual resources text-searchable for research. The process includes several steps: Feature extraction, Tokenizing, Low-Rank adapters (LoRA).

Thursday 26 September, 11:00 am CEST

Registration: https://bnf-fr.zoom.us/meeting/register/tJUtcuqspzouHtyV5xssfjqx7_-tcvt5nhCE

 

More news