Chopard, Daphne
2023.
Deep learning for clinical texts in low-data regimes.
PhD Thesis,
Cardiff University.
Item availability restricted. |
Preview |
PDF (Daphne Chopard PhD thesis)
- Accepted Post-Print Version
Available under License Creative Commons Attribution No Derivatives. Download (4MB) | Preview |
PDF (Daphne Chopard ORCA Form)
- Supplemental Material
Restricted to Repository staff only Download (403kB) |
Abstract
Electronic health records contain a wealth of valuable information for improving healthcare. There are, however, challenges associated with clinical text that prevent computers from maximising the utility of such information. While deep learning (DL) has emerged as a practical paradigm for dealing with the complexities of natural language, applying this class of machine learning algorithms to clinical text raises several research questions. First, we tackled the problem of data sparsity by looking into the task of adverse event detection. As these events are rare, examples thereof are lacking. To compensate for data scarcity, we leveraged large pre-trained language models (LMs) in combination with formally represented medical knowledge. We demonstrated that such a combination exhibits remarkable generalisation abilities despite the low availability of data. Second, we focused on the omnipresence of short forms in clinical texts. This typically leads to out-of-vocabulary problems, which motivates unlocking the underlying words. The novelty of our approach lies in its capacity to learn how to automatically expand short forms without resorting to external resources. Third, we investigated data augmentation to address the issue of data scarcity at its core. To the best of our knowledge, we were one of the firsts to investigate population-based augmentation for scheduling text data augmentation. Interestingly, little improvement was seen in fine-tuning large pre-trained LMs with the augmented data. We suggest that, as LMs proved able to cope well with small datasets, the need for data augmentation was made redundant. We conclude that DL approaches to clinical text mining should be developed by fine-tuning large LMs. One area where such models may struggle is the use of clinical short forms. Our method to automating their expansion fixes this issue. Together, these two approaches provide a blueprint for successfully developing DL approaches to clinical text mining in low-data regimes.
Item Type: | Thesis (PhD) |
---|---|
Date Type: | Completion |
Status: | Unpublished |
Schools: | Computer Science & Informatics |
Subjects: | Q Science > QA Mathematics > QA76 Computer software |
Date of First Compliant Deposit: | 15 March 2023 |
Date of Acceptance: | 13 March 2023 |
Last Modified: | 16 Mar 2023 09:22 |
URI: | https://orca.cardiff.ac.uk/id/eprint/157748 |
Actions (repository staff only)
Edit Item |