Lost in Translation: Repurposing Semantic Similarity Benchmarks for Evaluating Lexical-Semantic Consistency in LLM-Based Machine Translation
Ye, Quin and Bloem, Jelke
We propose and demonstrate a repurposing of the lexical similarity benchmark Multi-SimLex and the SimLex-999 family of resources for assessing the cross-lingual lexical-semantic consistency of multilingual large language models. While originally gathered for evaluating word embedding models, the parallel nature of the word pairs enables their use in machine translation settings. Using a manually verified subset of 500 word pairs from the Multi-SimLex dataset, we evaluate models’ ability to assess semantic similarity and perform translation between English and Mandarin through zero-shot prompting. We compare BLOOMZ and GPT-4’s similarity ratings against human-annotated benchmarks and examine translation consistency using our and other metrics, with GPT-4 showing stronger human alignment. As SimLex-999 and Multi-SimLex together cover a range of at least 25 languages, this approach has the potential to be extended to many language pairs including ones that don’t involve English, though it requires some manual checks.
Bridging the Low Resource Gap in Historical Cryptology: A Multilingual Diachronic Synthetic Dataset for Reproducible Cryptanalysis
Bruton, Micaella and Beloucif, Meriem and Megyesi, Beáta
Many NLP tasks suffer from limited aligned supervision in the target domain. Historical cipher decryption represents an extreme case: aligned plaintext–ciphertext pairs are scarce, access to decrypted archives is restricted, and prior work often relies on synthetic data that is neither released nor evaluated for realism. This limits reproducibility and obscures whether models trained on synthetic benchmarks transfer to archival conditions. We introduce HistCiph, the first publicly available multilingual collection of historically grounded plaintext–ciphertext datasets for classical ciphers. Spanning ten languages (Czech, Dutch, English, French, Hungarian, Icelandic, Italian, Polish, Spanish, Swedish) and multiple centuries, the collection combines diachronically balanced historical plaintext with independently generated homophonic substitution keys and controlled transcription noise. Synthetic generation is explicitly constrained by documented properties of historical ciphers, including multi-homophone allocation and variable-length codes. We validate the datasets using information-theoretic diagnostics—entropy, redundancy, frequency masking, and unicity distance—showing that ciphertext distributions approach theoretical bounds while preserving cross-linguistic variation. HistCiph provides a reproducible benchmark for neural decryption and alignment, and illustrates a principled framework for empirically grounded synthetic data generation in low-resource NLP.
Cultural Grounding in Swedish: Extending an Everyday Knowledge Benchmark for LLMs
Beloucif, Meriem and Sjons, Johan
Benchmarks for evaluating Large Language Models (LLMs) on everyday knowledge across cultures and languages are increasingly used to assess cultural competence and contextual understanding. However, many multilingual extensions rely primarily on translated question–answer pairs, limiting their ability to capture locally grounded variation. In this work, we present a Swedish extension of an existing cross-cultural everyday knowledge benchmark, in which questions are translated into Swedish and answers are collected individually from five participants with diverse social and professional backgrounds. This design enables us to capture situated, naturally produced responses from a specific participant group rather than transferred or translated answer templates. We document the translation protocol, participants, and agreement analysis, and examine variation across participants as a signal of culturally contingent knowledge. We evaluate several state-of-the-art multilingual and instruction-tuned LLMs against the aggregated human responses and analyze model performance. Our results reveal that while models often approximate prototypical answers, they struggle with culturally specific nuances and intra-cultural variation. The Swedish extension provides a resource for studying culturally grounded evaluation and highlights the importance of human-generated local answers when benchmarking LLMs across languages.
Entity Linking for Faroese Using Large Language Models with Web Search
Simonsen, Annika and Debess, Iben Nyholm and Einarsson, Hafsteinn
Entity linking connects text mentions to knowledge bases. For low-resource languages, entity linking has typically not been a research priority, as named entity recognition and knowledge base creation must first be addressed. We present the first study of entity linking for Faroese, a North Germanic language with approximately 70,000 speakers. Unlike traditional systems that rely on separate candidate retrieval and ranking components, we employ an end-to-end approach using GPT-5 with integrated web search. Our method prompts the model to directly identify and link named entities to Wikipedia pages through a three-tier fallback strategy: Faroese Wikipedia, English Wikipedia, and finally any available Wikipedia. We evaluate our approach on 1,010 manually annotated examples from a Faroese NER dataset, analyzing entity mentions across Person, Location, Organization, and Miscellaneous types. Human evaluation shows our system achieves 87.5% precision and 87.3% recall, with particularly strong performance on locations (93-95% precision, 92-95% recall). Persons are more challenging (86-88% precision, 72-83% recall). The majority of links (76.5%) point to Faroese Wikipedia, demonstrating the model’s ability to leverage language-specific knowledge bases. A Wikipedia API search baseline without any LLM achieves F1 = 0.57–0.60 on the same evaluation data, confirming that the LLM’s contextual reasoning provides substantial gains over simple search. We validate our approach across three models (GPT-5, Gemini 3 Flash, GPT-5.4 Mini), achieving F1 scores of 0.74–0.87 and confirming that the method generalizes across providers. This work establishes initial performance benchmarks for Faroese entity linking and demonstrates the viability of LLM-based approaches for low-resource languages.
From Polyester Girlfriends to Blind Mice: Creating the First Pragmatics Understanding Benchmarks for Slovene
Brglez, Mojca and Vintar, Špela
Large language models are demonstrating increasing capabilities, excelling at benchmarks once considered very difficult. As their capabilities grow, there is a need for more challenging evaluations that go beyond surface-level linguistic competence. The latter involves not only syntax and semantics but also pragmatics, i.e., understanding situational meaning shaped by context and linguistic and cultural norms. To contribute to this line of research, we introduce SloPragEval and SloPragMega, the first pragmatics understanding benchmarks for Slovene, comprising 405 multiple-choice questions. We discuss the difficulties of translation, describe the campaign to establish a human baseline, and report pilot evaluations with LLMs. Our results indicate that current models have substantially improved in their understanding of nuanced language but may still fail to infer implied speaker meaning in non-literal utterances, especially those that are culture-specific. We also observe a significant gap between proprietary and open-source models. Finally, we argue that benchmarks targeting nuanced language understanding and knowledge of the target culture must be designed with care, preferably constructed from native data, and validated with human responses.
SdQuAD: A Benchmark Question Answering Dataset for Low-resource Sindhi Language
Ali, Wazir and Rafay, Muhammad and Ali, Nadia and Rehman, Amar
Question answering (QA) datasets are crucial for developing and evaluating monolingual and multilingual language models, yet low-resource languages like Sindhi lack open-source QA resources. We introduce SdQuAD, a novel open-source textual QA dataset for the low-resource Sindhi language, comprising more than 14K QA pairs curated and annotated by native speakers using the Label Studio. Sourced from diverse domains, including news, history, science, geography, business, and tourism, SdQuAD supports both extractive and abstractive QA tasks while capturing Sindhi’s linguistic diversity. We assess annotation quality using span-level agreement and evaluate extractive performance with Exact Match (EM), F1 score, and a TF-IDF baseline. Additionally, we fine-tune mBERT, XLM-RoBERTa, and mT5 models on SdQuAD, benchmarking their performance to demonstrate the dataset’s utility.
LLMs as Assistants for Data Annotation: Addressing Disagreement and Supporting Expert Processes
Andrade, Mark and Hefernan, Bláithín and Walsh, Abigail and Castilho, Sheila
This paper investigates the potential of Large Language Models to assist human annotation pipelines, with a particular focus on supporting the development of expert-informed annotation guidelines for document-level content categorisation. We present three experiments exploring distinct roles for LLMs in annotation: as annotators, as domain experts assisting in disagreement resolution, and as analysts of annotator discussions. Using GPT-4.5 and Claude Sonnet 4, we evaluate LLM-generated annotation guidelines for a document-level classification tasks in terms of coverage, applicability, and usefulness. Preliminary results are mixed-to-positive, with evidence that LLMs can provide useful support across different stages of the annotation pipeline, particularly when supplied with rich contextual information such as prior human annotations and annotator discussions.
Annotation Quality in Aspect-Based Sentiment Analysis: A Case Study Comparing Experts, Students, Crowdworkers, and Large Language Models
Donhauser, Niklas and Fehle, Jakob and Hellwig, Nils Constantin and Weinberger, Markus and Kruschwitz, Udo and Wolff, Christian
Aspect-Based Sentiment Analysis (ABSA) enables fine-grained opinion analysis by identifying sentiments toward specific aspects or targets within a text. While ABSA has been widely studied for English, research on other languages such as German remains limited, largely due to the lack of high-quality annotated datasets. This paper examines how different annotation sources influence the development of German ABSA. To this end, an existing dataset is re-annotated by experts to establish a ground truth, which serves as a reference for evaluating annotations produced by students, crowdworkers, Large Language Models (LLMs), and experts. Annotation quality is compared using Inter-Annotator Agreement (IAA) and its impact on downstream model performance for different ABSA subtasks. The evaluation focuses on Aspect Category Sentiment Analysis (ACSA) and Target Aspect Sentiment Detection (TASD). We apply State-of-the-Art (SOTA) methods for ABSA, including BERT-, T5-, and LLaMA-based approaches to assess performance differences, spanning fine-tuning and in-context learning with instruction prompts. The findings provide practical insights into trade-offs between annotation reliability, and efficiency, offering guidance for dataset construction in under-resourced Natural Language Processing (NLP) scenarios.
Posters
Cross-Lingual Mathematical Reasoning in LLMs: Evaluating Performance on Icelandic vs. English Problems
Einarsson, Hafsteinn
We investigate whether large language models (LLMs) exhibit performance differences when solving mathematical problems presented in a low-resource language (Icelandic) versus a high-resource language (English). Using 847 multiple-choice problems from the Icelandic Mathematics Competition corpus (STAK), we evaluate two state-of-the-art models (Gemini-3-Flash-Preview and GPT-5.4-mini) in both multiple-choice (MC) and open-ended (OE) formats, with correctness determined by a three-judge quorum (Gemini-3-Flash, GPT-5.4-mini, Claude Sonnet 4.6) achieving 97.6% unanimous agreement. Our results reveal significant cross-lingual performance gaps that vary by model: Gemini-3-Flash shows a consistent English advantage of 2.4–10.0 percentage points across both evaluation modes, while GPT-5.4-mini exhibits no significant language effects. Notably, GPT-5.4-mini demonstrates a substantial MC deficit, achieving only 42% in that format despite reaching 69–71% accuracy on OE problems. Analysis of answer patterns reveals a strong option position bias in GPT-5.4-mini, with systematic over-selection of option B and under-selection of option D. These findings suggest that language does affect LLM mathematical reasoning for some models, but the effect is model-dependent and interacts with evaluation format, with implications for deploying LLMs in educational contexts for speakers of low-resource languages.
Struct2Unstruct: Creating Tender NER Datasets from Structured Procurement Records using Large Language Models
Abbas, Asim and Lee, Mark and Shanavas, Niloofer and Kovatchev, Venelin and Ali, Mubashir
Named Entity Recognition (NER) in the tender and procurement domain is critical for tasks such as contract monitoring, supplier analysis, and compliance tracking. However, unlike general-purpose NER, no open-source datasets exist for Tender NER, largely due to data sensitivity and confidentiality restrictions. This scarcity limits the development of automated entity extraction models. To address this gap, we propose struct2unstruct, a data preparation pipeline that generates and annotates tender-specific datasets using large language models (LLMs). Starting from structured procurement data published by the Singapore government (2015–2021) available in English language, we employ Llama-3 to generate synthetic tender narratives in multiple writing styles, ensuring each contains at least one tender-related entity. Post-processing steps correct inconsistencies in dates, symbols, and entity formats. Entities are then annotated using a BIO tagging scheme through deterministic alignment with structured fields, followed by expert validation to ensure accuracy. This study focuses on data preparation and evaluation, not model training. The resulting dataset provides a scalable resource for future Tender NER research in low-resource environments. By releasing both the dataset and pipeline as open-source resources, we establish a foundation for advancing domain-adapted information extraction and automated tender entity recognition.
Link Prediction for Event Logs in the Process Industry
Zhukova, Anastasia and Walton, Thomas and Lobmüller, Christian E. and Gipp, Bela
In the era of graph-based retrieval-augmented generation (RAG), link prediction is a significant preprocessing step for improving the quality of fragmented or incomplete domain-specific data for the graph retrieval. Knowledge management in the process industry uses RAG-based applications to optimize operations, ensure safety, and facilitate continuous improvement by effectively leveraging operational data and past insights. A key challenge in this domain is the fragmented nature of event logs in shift books, where related records are often kept separate, even though they belong to a single event or process. This fragmentation hinders the recommendation of previously implemented solutions to users, which is crucial in the timely problem-solving at live production sites. To address this problem, we develop a record linking model, which we define as a cross-document coreference resolution (CDCR) task. Record linking adapts the task definition of CDCR and combines two state-of-the-art CDCR models with the principles of natural language inference (NLI) and semantic text similarity (STS) to perform link prediction. The evaluation shows that our record linking model outperformed the best versions of our baselines, i.e., NLI and STS, by 28% (11.43 p) and 27.4% (11.21 p), respectively. Our work demonstrates that common NLP tasks can be combined and adapted to a domain-specific setting of the German process industry, improving data quality and connectivity in shift logs.
MultiZebraLogic: A Multilingual Logical Reasoning Benchmark
Bruun, Sofie Helene and Smart, Dan Saattrup
We create high-quality datasets for LLM evaluation of logical reasoning skills across nine different languages, which have been manually checked by fluent speakers. The datasets consist of so-called zebra puzzles, and we analyse different ways of tuning the difficulty of the puzzles to fit modern LLMs. This includes the size of the puzzle (number of objects and number of clues), as well as a novel addition of red herring clues containing only irrelevant information. We show that presence of red herrings indeed makes the puzzles significantly harder for the models, and we find puzzle sizes 2×3 and 4×5 are sufficiently challenging for GPT-4o mini (a non-reasoning model) and o3-mini (a reasoning model), respectively. We analyse whether LLM performance of these are sensitive to the language, the cultural sensitivity of the puzzle theme, and the choice of clue types. These analyses are conducted with English and Danish, where we show that there is no significant difference for either of these three aspects, at least for the OpenAI models GPT-4o mini and o3-mini, chosen as representative non-reasoning and reasoning models, respectively. We publish the datasets for each of the nine languages for the identified sizes 2×3 and 4×5. We also publish the code used to generate the puzzles, which can be used to extend the benchmark into more languages.
Progressing beyond Art Masterpieces or Touristic Clichés: how to assess your LLMs for cultural alignment?
Branco, António and Silva, João and Marques, Nuno and Gomes, Luis and Campos, Ricardo and Sequeira, Raquel and Nerea, Sara and Silva, Rodrigo and Marques, Miguel and Duarte, Rodrigo and Putyato, Artur and Folques, Diogo and Valente, Tiago
Although the cultural (mis)alignment of Large Language Models (LLMs) has attracted increasing attention—often framed in terms of cultural bias—until recently there has been limited work on the design and development of datasets for cultural assessment. Here, we review existing approaches to such datasets and identify their main limitations. To address these issues, we propose design guidelines for annotators and report on the construction of a dataset built according to these principles. We further present a series of contrastive experiments conducted with this dataset. The results demonstrate that our design yields test sets with greater discriminative power, effectively distinguishing between models specialized for a given culture and those that are not, ceteris paribus.
Evaluating Large Language Model-based Natural Language Generation for Modular Dialog systems
Emmerling, Vincent and Kowalski, Christoph and Robrecht-Hilbig, Amelie and Kopp, Stefan
While many dialogue systems currently use end-to-end solutions, modular systems offer greater control, sustainability, and more human-like dialogue. This makes them relevant, especially when aiming to study human behavior patterns in interactions or applying them to sensitive domains. In this paper, we develop an automated metric to measure the quality of an LLM-based NLG-component in a modular system based on the hallucination tendency and linguistic quality. We apply the metric to various language models and usage techniques and, based on the results, discuss the conditions a model must meet in order to be a good candidate for an NLG-component in a real-time capable dialogue system. Although such automated metrics cannot replace a real interaction study, they help to compare potential approaches of the individual modules. Therefore, they are indispensable when developing and testing modules in isolation. One advancement of the introduced metrics is that it is developed and tested on a German dataset, showing challenges when working with languages other than English and discrepancies to the abilities of
Generative AI assumed in current state-of-the-art literature.
JobResQA: Semi-Automatic Multilingual Benchmark Creation for LLM Machine Reading Comprehension on Résumés and Job Descriptions
Carrino, Casimiro Pio and Estrella, Paula and Zbib, Rabih and Escolano, Carlos and Fonollosa, José A. R.
We present a methodology for building privacy-preserving multilingual QA benchmarks in low-resource and sensitive domains, demonstrated through JobResQA, a multilingual MRC benchmark over synthetic HR documents. The dataset comprises 581 QA pairs across 105 synthetic résumé-job description pairs in five languages (English, Spanish, Italian, German, and Chinese), with questions spanning four types based on document source (intra vs. cross-document) and reasoning complexity (single-hop vs. multi-hop). We propose an anonymization synthetic data pipeline, with controlled attributes (via placeholders) to enable future fairness studies. Our cost-effective, human-in-the-loop translation pipeline based on TEaR methodology incorporates MQM error annotations and selective post-editing. Baseline evaluations across multiple open-weight LLM families using LLM-as-judge reveal higher performance on English and Spanish but substantial degradation for other languages, highlighting critical cross-lingual MRC gaps. Our pipeline, where LLMs act as synthesizers, translators, and evaluators under human oversight, constitutes a reusable methodology for resource creation and a case study in evaluation-integrity challenges of LLM-era benchmark construction.
Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese
Zaghouani, Wajdi and Aldous, Kholoud K. and Gao, Yicheng
When Large Language Models (LLMs) are deployed in Chinese-language settings, a troubling pattern emerges: safety systems that work well in English break down. These systems struggle to cross linguistic and cultural boundaries, leaving models exposed to adversarial prompts that exploit Chinese-specific evasion techniques, including Pinyin romanization, character decomposition, internet slang, and hedging tone. To address this gap, we introduce ChiSafe-PAS (Chinese Safety Pilot Annotation Set), a human-annotated benchmark of 1,897 adversarial Chinese prompts spanning four high-stakes domains: self-harm and violence, drug and illicit trade, fraud, and satire. Of these, 1,544 entries carry complete gold-standard annotations: a 3-class response label (REFUSE, SAFE-REDIRECT, RESPOND), a nine-category obfuscation taxonomy, a risk-level rating, and annotator rationale. We describe the dataset design, annotation process, and obfuscation taxonomy in detail. Our primary goal is practical: to give the research community a high-quality, culturally grounded resource for benchmarking LLM safety alignment. In doing so, we engage three broader tensions in the field: the blurring boundary between training and evaluation data, the need for domain coverage grounded in real-world risk, and the limits of scale as a substitute for cultural expertise.
Evaluating the Impact of LLM-Assisted Annotation in a Perspectivized Setting: the Case of FrameNet Annotation
Belcavello, Frederico and Matos, Ely and Lorenzi, Arthur and Bonoto, Lisandra and Ruiz, Lívia and Pereira, Luiz Fernando and Herbst, Victor and Navarro, Yulla and Abreu, Helen de Andrade and Dutra, Lívia and Torrent, Tiago Timponi
The use of LLM-based applications as a means to accelerate and/or substitute human labor in the creation of language resources and datasets is a reality. Nonetheless, despite the potential of such tools for linguistic research, an evaluation of their performance and impact on the creation of annotated datasets, especially under a perspectivized approach to NLP, is still missing. This paper contributes to the reduction of this gap by reporting on an extensive evaluation of the (semi-)automatization of FrameNet-like semantic annotation by the use of an LLM-based semantic role labeler. The methodology employed compares annotation time, coverage, and diversity in three experimental settings: manual, automatic, and semi-automatic annotation. Results show that the hybrid, semi-automatic annotation setting leads to increased frame diversity and similar annotation coverage, when compared to the human-only setting, while the automatic setting performs considerably worse in all metrics, except for annotation time, which remains similar.
A multilingual hallucination benchmark: MultiWikiQHalluA
Thoresen, Freja and Smart, Dan Saattrup
Most hallucination evaluations focus on English, leaving it unclear whether findings transfer to lower-resource languages. We investigate faithfulness hallucinations, defined as model-generated content that is fluent and plausible but diverges from the provided input or is internally inconsistent. Leveraging the multilingual MultiWikiQA dataset, we utilize the LettuceDetect framework to create synthetic hallucination datasets for 306 languages, from which we train token-level hallucination classifiers for 30 European languages. In this work, we present evaluations of model hallucinations on a selection of languages: English, Danish, German, and Icelandic. Using these classifiers, we evaluate the hallucination rates for Qwen3-0.6B, Qwen3-14B, Gemma-3-12B-IT, cogito-v1-preview-qwen-32B, and cogito-v1-preview-llama-70B. Our classifiers reveal notably higher hallucination rates for Qwen3-0.6B (up to 60% of answers containing at least one hallucination, peaking in Icelandic) and generally lower rates for larger models, with cogito-v1-preview-qwen-32B and cogito-v1-preview-llama-70B performing best on most languages. Hallucination rates are consistently higher for lower-resource languages, particularly Icelandic.
Exploring the similarities and differences between VLM-driven and traditional OCR for Historical Swedish Data
Johansson, Martin and Waginder, Selma and Dannélls, Dana
Recent Swedish OCR efforts rely primarily on traditional OCR methods, including deep CNN–LSTM hybrid neural networks and transformer-based models. Some approaches have also demonstrated the applicability of VLM-driven OCR to historical material. However, to date, no studies have examined in depth the performance of VLM-based OCR on historical Swedish sources. In this paper, we ask: How do transformers and VLMs differ in character- and word-level recognition performance across typefaces, and what qualitative differences can be observed in their error patterns? We show that fine-tuned versions of the Alibaba Cloud Qwen3-VL-8B-Instruct and Qwen3-VL-2B-Instruct, combined with a simple repetition-trimming step, outperform conventional OCR systems. Remaining errors are primarily attributable to challenges associated with the Blackletter typeface and formatting issues, such as missing or extra line breaks, characters, and spaces. Even when characters are correctly recognized, formatting inconsistencies can substantially increase transcription error rates.