Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

UniversalCEFR: Enabling open multilingual research on language proficiency assessment

Imperial, Joesph Marvin, Barayan, Abdullah, Stodden, Regina, Wilkens, Rodrigo, Muñoz Sánchez, Ricardo, Gao, Lingyun, Torgbi, Melisa, Knight, Dawn ORCID: https://orcid.org/0000-0002-4745-6502, Forey, Gail, Jablonkai, Reka R., Kochmar, Ekaterina, Reynolds, Robert, Ribeiro, Eugénio, Saggion, Horacio, Volodina, Elena, Vajjala, Sowmya, François, Thomas, Alva-Manchego, Fernando and Madabushi, Harish Tayyar 2025. UniversalCEFR: Enabling open multilingual research on language proficiency assessment. Presented at: Empirical Methods in Natural Language Processing (EMNLP 2025), Suzhou, China, 4-11 November 2025. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pp. 9703-9755. 10.18653/v1/2025.emnlp-main.491

[thumbnail of 2025.emnlp-main.491.pdf]
Preview
PDF - Published Version
Available under License Creative Commons Attribution.

Download (1MB) | Preview

Abstract

We introduce UniversalCEFR, a large-scale multilingual multidimensional dataset of texts annotated according to the CEFR (Common European Framework of Reference) scale in 13 languages. To enable open research in both automated readability and language proficiency assessment, UniversalCEFR comprises 505,807 CEFR-labeled texts curated from educational and learner-oriented resources, standardized into a unified data format to support consistent processing, analysis, and modeling across tasks and languages. To demonstrate its utility, we conduct benchmark experiments using three modelling paradigms: a) linguistic feature-based classification, b) fine-tuning pre-trained LLMs, and c) descriptor-based prompting of instruction-tuned LLMs. Our results further support using linguistic features and fine-tuning pretrained models in multilingual CEFR level assessment. Overall, UniversalCEFR aims to establish best practices in data distribution in language proficiency research by standardising dataset formats and promoting their accessibility to the global research community.

Item Type: Conference or Workshop Item - published (Paper)
Date Type: Published Online
Status: Published
Schools: Schools > English, Communication and Philosophy
Schools > Computer Science & Informatics
Publisher: Association for Computational Linguistics
ISBN: 9798891763326
Date of First Compliant Deposit: 10 September 2025
Last Modified: 06 Feb 2026 14:53
URI: https://orca.cardiff.ac.uk/id/eprint/180928

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

View more statistics