Ushio, Asahi, Zhou, Yi ORCID: https://orcid.org/0000-0001-7009-8515 and Camacho-Collados, Jose 2023. Efficient multilingual language model compression through vocabulary trimming. Presented at: The 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6 - 10 December 2023. Published in: Bouamor, Houda, Pino, Juan and Bali, Kalika eds. Findings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, pp. 14725-14739. 10.18653/v1/2023.findings-emnlp.981 |
Preview |
PDF
- Published Version
Available under License Creative Commons Attribution. Download (2MB) | Preview |
Abstract
Multilingual language models (LMs) have become a powerful tool in NLP, especially for non-English languages. Nevertheless, model parameters of multilingual LMs remain large due to the larger embedding matrix of the vocabulary covering tokens in different languages. Instead, monolingual LMs can be trained in a target language with the language-specific vocabulary only. In this paper, we propose vocabulary-trimming (VT), a method to reduce a multilingual LM vocabulary to a target language by deleting potentially irrelevant tokens from its vocabulary. In theory, VT can compress any existing multilingual LM to any language covered by the original model. In our experiments, we show that VT can retain the original performance of the multilingual LM, while being considerably smaller in size than the original multilingual LM. The evaluation is performed over four NLP tasks (two generative and two classification tasks) among four widely used multilingual LMs in seven languages. The results show that this methodology can keep the best of both monolingual and multilingual worlds by keeping a small size as monolingual models without the need for specifically retraining them, and can even help limit potentially harmful social biases.
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Date Type: | Published Online |
Status: | Published |
Schools: | Computer Science & Informatics |
Publisher: | Association for Computational Linguistics |
Date of First Compliant Deposit: | 17 October 2024 |
Date of Acceptance: | 7 October 2023 |
Last Modified: | 17 Oct 2024 13:20 |
URI: | https://orca.cardiff.ac.uk/id/eprint/170396 |
Actions (repository staff only)
Edit Item |