Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Distilling word vectors from contextualised language models

Wang, Yixiao 2023. Distilling word vectors from contextualised language models. PhD Thesis, Cardiff University.
Item availability restricted.

[thumbnail of 2023wangyphd.pdf]
Preview
PDF - Accepted Post-Print Version
Available under License Creative Commons Attribution Non-commercial No Derivatives.

Download (2MB) | Preview
[thumbnail of Cardiff University Electronic Publication Form] PDF (Cardiff University Electronic Publication Form) - Supplemental Material
Restricted to Repository staff only

Download (772kB)

Abstract

Although contextualised language models (CLMs) have reduced the need for word embedding in various NLP tasks, static representations of word meaning remain crucial in tasks where words have to be encoded without context. Such tasks arise in domains such as information retrieval. Compared to learning static word embeddings from scratch, distilling such representations from CLMs has advantages in downstream tasks[68],[2]. Usually, the embedding of a word w is distilled by feeding random sentences that mention w to a CLM and extracting the parameters. In this research, we assume distilling word embeddings from CLMs can be improved by feeding more informative mentions to a CLM. Therefore, as a first contribution in this thesis, we proposed a strategy for sentence selection by using a topic model. Since distilling high-quality word embeddings from CLMs requires many mentions for each word, we investigate whether we can obtain decent word embeddings by using a few but carefully selected mentions of each word. As our second contribution, we explored a range of sentence selection strategies and tested their generated word embeddings on various evaluation tasks. We found that 20 informative sentences per word are sufficient to obtain competitive word embeddings, especially when the sentences are selected by our proposed strategies. Besides improving the sentence selection strategy, as our third contribution, we also studied other strategies for obtaining word embeddings. We found that SBERT embeddings capture an aspect of word meaning that is highly complementary to the mention embeddings we previously focused on. Therefore, we proposed combining the vectors generated from these two methods through a contrastive learning model. The results confirm that combining these vectors leads to more informative word embeddings. In conclusion, this thesis shows that better static word embeddings can be efficiently distilled from CLMs by strategically selecting sentences and combining complementary methods

Item Type: Thesis (PhD)
Date Type: Completion
Status: Unpublished
Schools: Computer Science & Informatics
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Funders: School of Computer Science and Informatics 2 and 1/4 year stipend
Date of First Compliant Deposit: 25 October 2023
Last Modified: 25 Oct 2023 09:18
URI: https://orca.cardiff.ac.uk/id/eprint/163139

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

View more statistics