Miok, Kristian, Corcoran, Padraig ORCID: https://orcid.org/0000-0001-9731-3385 and Spasic, Irena ORCID: https://orcid.org/0000-0002-8132-3885 2023. The value of numbers in clinical text classification. Machine Learning and Knowledge Extraction 5 (3) , pp. 746-762. 10.3390/make5030040 |
Preview |
PDF
- Published Version
Available under License Creative Commons Attribution. Download (3MB) | Preview |
Abstract
Clinical text often includes numbers of various types and formats. However, most current text classification approaches do not take advantage of these numbers. This study aims to demonstrate that using numbers as features can significantly improve the performance of text classification models. This study also demonstrates the feasibility of extracting such features from clinical text. Unsupervised learning was used to identify patterns of number usage in clinical text. These patterns were analyzed manually and converted into pattern-matching rules. Information extraction was used to incorporate numbers as features into a document representation model. We evaluated text classification models trained on such representation. Our experiments were performed with two document representation models (vector space model and word embedding model) and two classification models (support vector machines and neural networks). The results showed that even a handful of numerical features can significantly improve text classification performance. We conclude that commonly used document representations do not represent numbers in a way that machine learning algorithms can effectively utilize them as features. Although we demonstrated that traditional information extraction can be effective in converting numbers into features, further community-wide research is required to systematically incorporate number representation into the word embedding process.
Item Type: | Article |
---|---|
Date Type: | Publication |
Status: | Published |
Schools: | Computer Science & Informatics |
Subjects: | Q Science > QA Mathematics > QA76 Computer software |
Publisher: | MDPI |
ISSN: | 2504-4990 |
Date of First Compliant Deposit: | 24 July 2023 |
Date of Acceptance: | 5 July 2023 |
Last Modified: | 24 Jul 2023 18:55 |
URI: | https://orca.cardiff.ac.uk/id/eprint/160867 |
Actions (repository staff only)
Edit Item |