Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

English–Welsh cross-lingual embeddings

Espinosa-Anke, Luis ORCID: https://orcid.org/0000-0001-6830-9176, Palmer, Geraint ORCID: https://orcid.org/0000-0001-7865-6964, Filimonov, Maxim, Corcoran, Padraig ORCID: https://orcid.org/0000-0001-9731-3385, Spasic, Irena ORCID: https://orcid.org/0000-0002-8132-3885 and Knight, Dawn ORCID: https://orcid.org/0000-0002-4745-6502 2021. English–Welsh cross-lingual embeddings. Applied Sciences 11 (14) , 6541. 10.3390/app11146541

[thumbnail of applsci-11-06541.pdf] PDF - Published Version
Available under License Creative Commons Attribution.

Download (335kB)

Abstract

Cross-lingual embeddings are vector space representations where word translations tend to be co-located. These representations enable learning transfer across languages, thus bridging the gap between data-rich languages such as English and others. In this paper, we present and evaluate a suite of cross-lingual embeddings for the English–Welsh language pair. To train the bilingual embeddings, a Welsh corpus of approximately 145 M words was combined with an English Wikipedia corpus. We used a bilingual dictionary to frame the problem of learning bilingual mappings as a supervised machine learning task, where a word vector space is first learned independently on a monolingual corpus, after which a linear alignment strategy is applied to map the monolingual embeddings to a common bilingual vector space. Two approaches were used to learn monolingual embeddings, including word2vec and fastText. Three cross-language alignment strategies were explored, including cosine similarity, inverted softmax and cross-domain similarity local scaling (CSLS). We evaluated different combinations of these approaches using two tasks, bilingual dictionary induction, and cross-lingual sentiment analysis. The best results were achieved using monolingual fastText embeddings and the CSLS metric. We also demonstrated that by including a few automatically translated training documents, the performance of a cross-lingual text classifier for Welsh can increase by approximately 20 percent points.

Item Type: Article
Date Type: Publication
Status: Published
Schools: English, Communication and Philosophy
Computer Science & Informatics
Mathematics
Data Innovation Research Institute (DIURI)
Subjects: Q Science > QA Mathematics > QA76 Computer software
Additional Information: This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited
Publisher: MDPI
ISSN: 2076-3417
Funders: Welsh Government
Date of First Compliant Deposit: 19 July 2021
Date of Acceptance: 5 July 2021
Last Modified: 14 May 2023 16:38
URI: https://orca.cardiff.ac.uk/id/eprint/142688

Citation Data

Cited 4 times in Scopus. View in Scopus. Powered By Scopus® Data

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

View more statistics