Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Creating Welsh language word embeddings

Corcoran, Padraig, Palmer, Geraint, Arman, Laura, Knight, Dawn and Spasic, Irena 2021. Creating Welsh language word embeddings. Applied Sciences 11 (15) , 6896. 10.3390/app11156896

[img] PDF - Published Version
Available under License Creative Commons Attribution.

Download (304kB)

Abstract

Word embeddings are representations of words in a vector space that models semantic relationships between words by means of distance and direction. In this study, we adapted two existing methods, word2vec and fastText, to automatically learn Welsh word embeddings taking into account syntactic and morphological idiosyncrasies of this language. These methods exploit the principles of distributional semantics and, therefore, require a large corpus to be trained on. However, Welsh is a minoritised language, hence significantly less Welsh language data are publicly available in comparison to English. Consequently, assembling a sufficiently large text corpus is not a straightforward endeavour. Nonetheless, we compiled a corpus of 92,963,671 words from 11 sources, which represents the largest corpus of Welsh. The relative complexity of Welsh punctuation made the tokenisation of this corpus relatively challenging as punctuation could not be used for boundary detection. We considered several tokenisation methods including one designed specifically for Welsh. To account for rich inflection, we used a method for learning word embeddings that is based on subwords and, therefore, can more effectively relate different surface forms during the training phase. We conducted both qualitative and quantitative evaluation of the resulting word embeddings, which outperformed previously described word embeddings in Welsh as part of larger study including 157 languages. Our study was the first to focus specifically on Welsh word embeddings.

Item Type: Article
Date Type: Published Online
Status: Published
Schools: English, Communication and Philosophy
Computer Science & Informatics
Mathematics
Data Innovation Research Institute (DIURI)
Subjects: P Language and Literature > P Philology. Linguistics
Q Science > QA Mathematics > QA76 Computer software
Additional Information: This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Publisher: MDPI
ISSN: 2076-3417
Funders: Welsh Government
Date of First Compliant Deposit: 6 August 2021
Date of Acceptance: 21 July 2021
Last Modified: 06 Aug 2021 09:45
URI: http://orca.cardiff.ac.uk/id/eprint/142952

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

View more statistics