Lexicography in NLP: A study on the interaction between lexical resources and Large Language Models

Almeman, Fatemh 2025. Lexicography in NLP: A study on the interaction between lexical resources and Large Language Models. PhD Thesis, Cardiff University.

Item availability restricted.

Preview	PDF (Fatemah Almeman PhD Thesis) - Accepted Post-Print Version Download (4MB) \| Preview
	PDF (Cardiff University Electronic Publication Form) - Supplemental Material Restricted to Repository staff only Download (119kB)

Abstract

This thesis explores the interaction between lexical resources (LRs) and large language models (LLMs) in the context of natural language processing, focusing on the evaluation of WordNet (WN)—the de facto lexical database for English—along with the development of a new dataset and a novel reverse dictionary (RD) method. The investigation starts with an assessment of WN, particularly its examples, both intrinsically and extrinsically, compared to other resources using the Good Dictionary EXamples (GDEX) framework. This evaluation shows that WN’s examples are often limited in length and informativeness. In an extrinsic analysis, we examined WN’s performance in definition modeling and word similarity tasks, where informative contextual representations are essential. Results indicate that LLM-generated examples are more informative than those from WN. To overcome limitations in LRs (some uncovered by our analysis), we then introduce a new dataset called 3D-EX providing terms, definitions, and usage examples. It integrates entries from ten diverse English dictionaries and encyclopedias with varying linguistic styles. We conducted intrinsic experiments on source classification, predicting the origin of a <term, definition> instance, and RD, which retrieves a ranked list of terms from a definition. Results indicate that 3D-EX enhances performance in both tasks, highlighting its usefulness for NLP. This thesis further explores RD by introducing GEAR, a lightweight and unsupervised approach to RD tasks. GEAR operates through four stages: Generate, Embed, Average, and Rank. It was evaluated using the Hill dataset, a leading benchmark for RD tasks, and it consistently outperformed existing methods. In conclusion, this thesis investigates how LLMs and LRs can benefit each other. We identified limitations in some resources and found that LLMs are a suitable tool for addressing them. Additionally, LLMs can automatically improve language resources by unifying them with different anchors. Datasets and code are publicly available.

Item Type:	Thesis (PhD)
Date Type:	Completion
Status:	Unpublished
Schools:	Schools > Computer Science & Informatics
Subjects:	Q Science > QA Mathematics > QA75 Electronic computers. Computer science Q Science > QA Mathematics > QA76 Computer software
Funders:	Saudi Government Scholarship
Date of First Compliant Deposit:	7 November 2025
Date of Acceptance:	31 October 2025
Last Modified:	07 Nov 2025 18:00
URI:	https://orca.cardiff.ac.uk/id/eprint/182226

Actions (repository staff only)

Edit Item

Download Statistics

Downloads

Downloads per month over past year

View more statistics

CORE (COnnecting REpositories)