Sentence selection strategies for distilling word embeddings from BERT

Wang, Yixiao, Bouraoui, Zied, Espinosa-Anke, Luis

and Schockaert, Steven

2022. Sentence selection strategies for distilling word embeddings from BERT. Presented at: LREC 2022, Marseille, France, 20-25 June 2022. Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022). European Language Resources Association (ELRA), pp. 2591-2600.

PDF - Published Version
Available under License Creative Commons Attribution Non-commercial.
Download (270kB)

Official URL: http://lrec-conf.org/proceedings/lrec2022/pdf/2022...

Abstract

Many applications crucially rely on the availability of high-quality word vectors. To learn such representations, several strategies based on language models have been proposed in recent years. While effective, these methods typically rely on a large number of contextualised vectors for each word, which makes them impractical. In this paper, we investigate whether similar results can be obtained when only a few contextualised representations of each word can be used. To this end, we analyze a range of strategies for selecting the most informative sentences. Our results show that with a careful selection strategy, high-quality word vectors can be learned from as few as 5 to 10 sentences.

Item Type:	Conference or Workshop Item (Paper)
Date Type:	Publication
Status:	Published
Schools:	Professional Services > Advanced Research Computing @ Cardiff (ARCCA) Schools > Computer Science & Informatics
Additional Information:	© European Language Resources Association (ELRA), licensed under CC-BY-NC-4.0
Publisher:	European Language Resources Association (ELRA)
Funders:	EPSRC
Date of First Compliant Deposit:	25 May 2022
Date of Acceptance:	4 April 2022
Last Modified:	09 Jul 2025 09:28
URI:	https://orca.cardiff.ac.uk/id/eprint/150045

Actions (repository staff only)

Edit Item

Download Statistics

Downloads

Downloads per month over past year

View more statistics

CORE (COnnecting REpositories)