Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Efficient multilingual language model compression through vocabulary trimming

Ushio, Asahi, Zhou, Yi ORCID: https://orcid.org/0000-0001-7009-8515 and Camacho-Collados, Jose 2023. Efficient multilingual language model compression through vocabulary trimming. Presented at: The 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6 - 10 December 2023. Published in: Bouamor, Houda, Pino, Juan and Bali, Kalika eds. Findings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, pp. 14725-14739. 10.18653/v1/2023.findings-emnlp.981

[thumbnail of 2023.findings-emnlp.981.pdf]
Preview
PDF - Published Version
Available under License Creative Commons Attribution.

Download (2MB) | Preview

Abstract

Multilingual language models (LMs) have become a powerful tool in NLP, especially for non-English languages. Nevertheless, model parameters of multilingual LMs remain large due to the larger embedding matrix of the vocabulary covering tokens in different languages. Instead, monolingual LMs can be trained in a target language with the language-specific vocabulary only. In this paper, we propose vocabulary-trimming (VT), a method to reduce a multilingual LM vocabulary to a target language by deleting potentially irrelevant tokens from its vocabulary. In theory, VT can compress any existing multilingual LM to any language covered by the original model. In our experiments, we show that VT can retain the original performance of the multilingual LM, while being considerably smaller in size than the original multilingual LM. The evaluation is performed over four NLP tasks (two generative and two classification tasks) among four widely used multilingual LMs in seven languages. The results show that this methodology can keep the best of both monolingual and multilingual worlds by keeping a small size as monolingual models without the need for specifically retraining them, and can even help limit potentially harmful social biases.

Item Type: Conference or Workshop Item (Paper)
Date Type: Published Online
Status: Published
Schools: Computer Science & Informatics
Publisher: Association for Computational Linguistics
Date of First Compliant Deposit: 17 October 2024
Date of Acceptance: 7 October 2023
Last Modified: 17 Oct 2024 13:20
URI: https://orca.cardiff.ac.uk/id/eprint/170396

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

View more statistics