Almeman, Fatemh
2025.
Lexicography in NLP: A study on the interaction between lexical resources and Large Language Models.
PhD Thesis,
Cardiff University.
Item availability restricted. |
Preview |
PDF (Fatemah Almeman PhD Thesis)
- Accepted Post-Print Version
Download (4MB) | Preview |
|
PDF (Cardiff University Electronic Publication Form)
- Supplemental Material
Restricted to Repository staff only Download (119kB) |
Abstract
This thesis explores the interaction between lexical resources (LRs) and large language models (LLMs) in the context of natural language processing, focusing on the evaluation of WordNet (WN)—the de facto lexical database for English—along with the development of a new dataset and a novel reverse dictionary (RD) method. The investigation starts with an assessment of WN, particularly its examples, both intrinsically and extrinsically, compared to other resources using the Good Dictionary EXamples (GDEX) framework. This evaluation shows that WN’s examples are often limited in length and informativeness. In an extrinsic analysis, we examined WN’s performance in definition modeling and word similarity tasks, where informative contextual representations are essential. Results indicate that LLM-generated examples are more informative than those from WN. To overcome limitations in LRs (some uncovered by our analysis), we then introduce a new dataset called 3D-EX providing terms, definitions, and usage examples. It integrates entries from ten diverse English dictionaries and encyclopedias with varying linguistic styles. We conducted intrinsic experiments on source classification, predicting the origin of a <term, definition> instance, and RD, which retrieves a ranked list of terms from a definition. Results indicate that 3D-EX enhances performance in both tasks, highlighting its usefulness for NLP. This thesis further explores RD by introducing GEAR, a lightweight and unsupervised approach to RD tasks. GEAR operates through four stages: Generate, Embed, Average, and Rank. It was evaluated using the Hill dataset, a leading benchmark for RD tasks, and it consistently outperformed existing methods. In conclusion, this thesis investigates how LLMs and LRs can benefit each other. We identified limitations in some resources and found that LLMs are a suitable tool for addressing them. Additionally, LLMs can automatically improve language resources by unifying them with different anchors. Datasets and code are publicly available.
| Item Type: | Thesis (PhD) |
|---|---|
| Date Type: | Completion |
| Status: | Unpublished |
| Schools: | Schools > Computer Science & Informatics |
| Subjects: | Q Science > QA Mathematics > QA75 Electronic computers. Computer science Q Science > QA Mathematics > QA76 Computer software |
| Funders: | Saudi Government Scholarship |
| Date of First Compliant Deposit: | 7 November 2025 |
| Date of Acceptance: | 31 October 2025 |
| Last Modified: | 07 Nov 2025 18:00 |
| URI: | https://orca.cardiff.ac.uk/id/eprint/182226 |
Actions (repository staff only)
![]() |
Edit Item |




Download Statistics
Download Statistics