Wang, Yixiao, Bouraoui, Zied, Espinosa-Anke, Luis ORCID: https://orcid.org/0000-0001-6830-9176 and Schockaert, Steven ORCID: https://orcid.org/0000-0002-9256-2881 2022. Sentence selection strategies for distilling word embeddings from BERT. Presented at: LREC 2022, 20-25 June 2022. Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022). European Language Resources Association (ELRA), pp. 2591-2600. |
PDF
- Published Version
Available under License Creative Commons Attribution Non-commercial. Download (270kB) |
Abstract
Many applications crucially rely on the availability of high-quality word vectors. To learn such representations, several strategies based on language models have been proposed in recent years. While effective, these methods typically rely on a large number of contextualised vectors for each word, which makes them impractical. In this paper, we investigate whether similar results can be obtained when only a few contextualised representations of each word can be used. To this end, we analyze a range of strategies for selecting the most informative sentences. Our results show that with a careful selection strategy, high-quality word vectors can be learned from as few as 5 to 10 sentences.
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Status: | In Press |
Schools: | Advanced Research Computing @ Cardiff (ARCCA) Computer Science & Informatics |
Additional Information: | © European Language Resources Association (ELRA), licensed under CC-BY-NC-4.0 |
Publisher: | European Language Resources Association (ELRA) |
Funders: | EPSRC |
Date of First Compliant Deposit: | 25 May 2022 |
Date of Acceptance: | 4 April 2022 |
Last Modified: | 14 Jun 2024 15:16 |
URI: | https://orca.cardiff.ac.uk/id/eprint/150045 |
Actions (repository staff only)
Edit Item |