Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

The value of numbers in clinical text classification

Miok, Kristian, Corcoran, Padraig ORCID: https://orcid.org/0000-0001-9731-3385 and Spasic, Irena ORCID: https://orcid.org/0000-0002-8132-3885 2023. The value of numbers in clinical text classification. Machine Learning and Knowledge Extraction 5 (3) , pp. 746-762. 10.3390/make5030040

[thumbnail of make-05-00040.pdf]
Preview
PDF - Published Version
Available under License Creative Commons Attribution.

Download (3MB) | Preview

Abstract

Clinical text often includes numbers of various types and formats. However, most current text classification approaches do not take advantage of these numbers. This study aims to demonstrate that using numbers as features can significantly improve the performance of text classification models. This study also demonstrates the feasibility of extracting such features from clinical text. Unsupervised learning was used to identify patterns of number usage in clinical text. These patterns were analyzed manually and converted into pattern-matching rules. Information extraction was used to incorporate numbers as features into a document representation model. We evaluated text classification models trained on such representation. Our experiments were performed with two document representation models (vector space model and word embedding model) and two classification models (support vector machines and neural networks). The results showed that even a handful of numerical features can significantly improve text classification performance. We conclude that commonly used document representations do not represent numbers in a way that machine learning algorithms can effectively utilize them as features. Although we demonstrated that traditional information extraction can be effective in converting numbers into features, further community-wide research is required to systematically incorporate number representation into the word embedding process.

Item Type: Article
Date Type: Publication
Status: Published
Schools: Computer Science & Informatics
Subjects: Q Science > QA Mathematics > QA76 Computer software
Publisher: MDPI
ISSN: 2504-4990
Date of First Compliant Deposit: 24 July 2023
Date of Acceptance: 5 July 2023
Last Modified: 24 Jul 2023 18:55
URI: https://orca.cardiff.ac.uk/id/eprint/160867

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

View more statistics