Espinosa-Anke, Luis ORCID: https://orcid.org/0000-0001-6830-9176, Palmer, Geraint ORCID: https://orcid.org/0000-0001-7865-6964, Filimonov, Maxim, Corcoran, Padraig ORCID: https://orcid.org/0000-0001-9731-3385, Spasic, Irena ORCID: https://orcid.org/0000-0002-8132-3885 and Knight, Dawn ORCID: https://orcid.org/0000-0002-4745-6502 2021. English–Welsh cross-lingual embeddings. Applied Sciences 11 (14) , 6541. 10.3390/app11146541 |
PDF
- Published Version
Available under License Creative Commons Attribution. Download (335kB) |
Abstract
Cross-lingual embeddings are vector space representations where word translations tend to be co-located. These representations enable learning transfer across languages, thus bridging the gap between data-rich languages such as English and others. In this paper, we present and evaluate a suite of cross-lingual embeddings for the English–Welsh language pair. To train the bilingual embeddings, a Welsh corpus of approximately 145 M words was combined with an English Wikipedia corpus. We used a bilingual dictionary to frame the problem of learning bilingual mappings as a supervised machine learning task, where a word vector space is first learned independently on a monolingual corpus, after which a linear alignment strategy is applied to map the monolingual embeddings to a common bilingual vector space. Two approaches were used to learn monolingual embeddings, including word2vec and fastText. Three cross-language alignment strategies were explored, including cosine similarity, inverted softmax and cross-domain similarity local scaling (CSLS). We evaluated different combinations of these approaches using two tasks, bilingual dictionary induction, and cross-lingual sentiment analysis. The best results were achieved using monolingual fastText embeddings and the CSLS metric. We also demonstrated that by including a few automatically translated training documents, the performance of a cross-lingual text classifier for Welsh can increase by approximately 20 percent points.
Item Type: | Article |
---|---|
Date Type: | Publication |
Status: | Published |
Schools: | English, Communication and Philosophy Computer Science & Informatics Mathematics Data Innovation Research Institute (DIURI) |
Subjects: | Q Science > QA Mathematics > QA76 Computer software |
Additional Information: | This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited |
Publisher: | MDPI |
ISSN: | 2076-3417 |
Funders: | Welsh Government |
Date of First Compliant Deposit: | 19 July 2021 |
Date of Acceptance: | 5 July 2021 |
Last Modified: | 14 May 2023 16:38 |
URI: | https://orca.cardiff.ac.uk/id/eprint/142688 |
Citation Data
Cited 4 times in Scopus. View in Scopus. Powered By Scopus® Data
Actions (repository staff only)
Edit Item |