Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Solving cosine similarity underestimation between high frequency words by ℓ2 norm discounting

Wannasuphoprasit, Saeth, Zhou, Yi ORCID: https://orcid.org/0000-0001-7009-8515 and Bollegala, Danushka 2023. Solving cosine similarity underestimation between high frequency words by ℓ2 norm discounting. Presented at: 61st Annual Meeting of the Association of Computational Linguistics, Toronto, Canada, 9 - 14 July 2023. Published in: Rogers, Anna, Boyd-Graber, Jordan and Okazaki, Naoaki eds. Findings of the Association of Computational Linguistics: ACL 2023. Association of Computational Linguistics, pp. 8644-8652. 10.18653/v1/2023.findings-acl.550

Full text not available from this repository.

Abstract

Cosine similarity between two words, computed using their contextualised token embeddings obtained from masked language models (MLMs) such as BERT has shown to underestimate the actual similarity between those words CITATION.This similarity underestimation problem is particularly severe for high frequent words. Although this problem has been noted in prior work, no solution has been proposed thus far. We observe that the ℓ2 norm of contextualised embeddings of a word correlates with its log-frequency in the pretraining corpus.Consequently, the larger ℓ2 norms associated with the high frequent words reduce the cosine similarity values measured between them, thus underestimating the similarity scores.To solve this issue, we propose a method to discount the ℓ2 norm of a contextualised word embedding by the frequency of that word in a corpus when measuring the cosine similarities between words.We show that the so called stop words behave differently from the rest of the words, which require special consideration during their discounting process.Experimental results on a contextualised word similarity dataset show that our proposed discounting method accurately solves the similarity underestimation problem.An anonymized version of the source code of our proposed method is submitted to the reviewing system.

Item Type: Conference or Workshop Item (Paper)
Date Type: Published Online
Status: Published
Schools: Computer Science & Informatics
Publisher: Association of Computational Linguistics
Related URLs:
Last Modified: 05 Aug 2024 15:17
URI: https://orca.cardiff.ac.uk/id/eprint/170397

Actions (repository staff only)

Edit Item Edit Item