Beáta Megyesi
Stockholm University


Bio

Beáta Megyesi is a Professor of Computational Linguistics at Stockholm University. She completed her studies at Stockholm University and earned her doctorate from the Royal Institute of Technology in 2002. In 2003, she joined Uppsala University, where she advanced from assistant professor to full professor by 2021. Over the course of her academic career, she has published more than 100 publications and secured nearly 100 million SEK in external funding. She has served as Chair of the Swedish Research Council's Linguistics Review Panel, President of the Northern European Association for Language Technology, and Head of the Department of Linguistics and Philology at Uppsala University. She is currently the Principal Investigator of the DECRYPT project (2018–2025) on historical cryptology financed by the Swedish Research Council and leads the DESCRYPT: Echoes of History Analysis and Decipherment of Historical Writings program funded by Riksbankens Jubileumsfond (2025–2032).

Talk: Unlocking Hidden Histories: AI and Expert Collaboration in Deciphering Rare Scripts

Manuscripts written in rare or unknown scripts represent a largely untapped reservoir of historical and cultural knowledge, yet their study is frequently sidelined due to the multifaceted challenges they present. These texts, characterized by unique linguistic structures and diverse symbol sets, demand an interdisciplinary approach that spans linguistic analysis, paleography, cryptanalysis, and cultural studies. While recent advancements in artificial intelligence have introduced promising tools for automating tasks such as identification and transcription, the nuanced interpretation and verification of these manuscripts remain firmly in the realm of human expertise. In this talk, I will explore the inherent complexities of working with rare scripts, discuss the current state of automation in manuscript analysis, and argue for the development of hybrid systems that combine AI efficiency with expert intervention. By enabling minimal corrective inputs and adapting models to various handwriting styles and script idiosyncrasies, such systems have the potential to bridge the gap between computational capabilities and the specialized domain knowledge required for meaningful historical interpretation.



Jussi Karlgren
Free researcher


Bio

Jussi Karlgren is an industrial researcher in language technology and AI and a docent of language technology at University of Helsinki. He works with language models at Silo AI, a part of AMD and is an advisor and board member at many companies and public organisations. Currently, he is engaged in quality assessment of generative language models through the ELOQUENT shared task lab and in OpenEuroLLM and DeployAI projects.

Talk: What are the most sustainable and valuable resources that language technologists should develop for training language models?

The current generation of generative language models exhibits impressive behaviour in many language processing tasks, thanks to their capacity to estimate a probability distribution over linguistic elements by observing linguistic data. These successes have been achieved through training models on very large data sets, which may be difficult to establish for languages with less digital footprint than the largest ones. There will be new architectures, memory models, and processes to train models as technical development advances, and the amount of data needed to train models is likely to change in the near future making it possible to train models at less cost. What resources should language technology research focus on to address the likely needs of future generations of representations?



Joshua Wilbur
University of Tartu


Bio

Joshua Wilbur works at the Center for Digital Humanities at the Institute of Estonian and General Linguistics at the University of Tartu. He holds a PhD in general linguistics and published a Grammar of Pite Saami based on his documentation corpus in 2014. In the context of a postdoc project at the University of Freiburg, he began working with the Giellatekno Research group for Saami Language Technology in 2016 to implement an NLP infrastructure for Pite Saami, including automatically annotating his spoken-language documentation corpus. He also published a (paper) dictionary of Pite Saami and continues to develop and maintain online digital lexical resources for Pite Saami. After 3 years as Visiting Lecturer in Digital Humanities, he begin his current position as Lecturer in Digital Linguistics in 2023. Joshua is the National Coordinator for CLARIN ERIC in Estonia and the Chair of the Pite Saami Standardization Committee.

Talk: Digitizing Pite Saami: Making the most of limited resources

Pite Saami is a critically endangered Uralic language spoken by only a few dozen individuals originating from areas in and around Arjeplog in Swedish Lapland. Due to the exceptionally small number of native speakers, a very limited amount of language data is available; nonetheless, there is a surprisingly diverse set of language resources available, both in digital and in analogue form. In this talk, I will explore the perhaps extraordinary state of Pite Saami language data and digital tools, including how this came about, what potential the data holds in the context of current technological advances, and the challenges involved in this. In doing so, I hope to provide a starting point for a discussion on both the realities and realistic prospects of developing NLP for seriously under-resourced languages.