Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Sentence selection strategies for distilling word embeddings from BERT

Wang, Yixiao, Bouraoui, Zied, Espinosa-Anke, Luis ORCID: and Schockaert, Steven ORCID: 2022. Sentence selection strategies for distilling word embeddings from BERT. Presented at: LREC 2022, 20-25 June 2022. Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022). European Language Resources Association (ELRA), pp. 2591-2600.

[thumbnail of 2022.lrec-1.277.pdf] PDF - Published Version
Available under License Creative Commons Attribution Non-commercial.

Download (270kB)


Many applications crucially rely on the availability of high-quality word vectors. To learn such representations, several strategies based on language models have been proposed in recent years. While effective, these methods typically rely on a large number of contextualised vectors for each word, which makes them impractical. In this paper, we investigate whether similar results can be obtained when only a few contextualised representations of each word can be used. To this end, we analyze a range of strategies for selecting the most informative sentences. Our results show that with a careful selection strategy, high-quality word vectors can be learned from as few as 5 to 10 sentences.

Item Type: Conference or Workshop Item (Paper)
Status: In Press
Schools: Computer Science & Informatics
Additional Information: © European Language Resources Association (ELRA), licensed under CC-BY-NC-4.0
Publisher: European Language Resources Association (ELRA)
Funders: EPSRC
Date of First Compliant Deposit: 25 May 2022
Date of Acceptance: 4 April 2022
Last Modified: 10 Nov 2022 11:19

Actions (repository staff only)

Edit Item Edit Item


Downloads per month over past year

View more statistics