Wang, Yixiao, Bouraoui, Zied, Espinosa-Anke, Luis ORCID: https://orcid.org/0000-0001-6830-9176 and Schockaert, Steven ORCID: https://orcid.org/0000-0002-9256-2881
2022.
Sentence selection strategies for distilling word embeddings from BERT.
Presented at: LREC 2022,
Marseille, France,
20-25 June 2022.
Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022).
European Language Resources Association (ELRA),
pp. 2591-2600.
|
|
PDF
- Published Version
Available under License Creative Commons Attribution Non-commercial. Download (270kB) |
Abstract
Many applications crucially rely on the availability of high-quality word vectors. To learn such representations, several strategies based on language models have been proposed in recent years. While effective, these methods typically rely on a large number of contextualised vectors for each word, which makes them impractical. In this paper, we investigate whether similar results can be obtained when only a few contextualised representations of each word can be used. To this end, we analyze a range of strategies for selecting the most informative sentences. Our results show that with a careful selection strategy, high-quality word vectors can be learned from as few as 5 to 10 sentences.
| Item Type: | Conference or Workshop Item (Paper) |
|---|---|
| Date Type: | Publication |
| Status: | Published |
| Schools: | Professional Services > Advanced Research Computing @ Cardiff (ARCCA) Schools > Computer Science & Informatics |
| Additional Information: | © European Language Resources Association (ELRA), licensed under CC-BY-NC-4.0 |
| Publisher: | European Language Resources Association (ELRA) |
| Funders: | EPSRC |
| Date of First Compliant Deposit: | 25 May 2022 |
| Date of Acceptance: | 4 April 2022 |
| Last Modified: | 09 Jul 2025 09:28 |
| URI: | https://orca.cardiff.ac.uk/id/eprint/150045 |
Actions (repository staff only)
![]() |
Edit Item |





Download Statistics
Download Statistics