Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Gene filtering strategies for machine learning guided biomarker discovery using neonatal sepsis RNA-seq data

Parkinson, Edward ORCID:, Liberatore, Federico ORCID:, Watkins, W. John ORCID:, Andrews, Robert, Edkins, Sarah ORCID:, Hibbert, Julie, Strunk, Tobias, Currie, Andrew and Ghazal, Peter ORCID: 2023. Gene filtering strategies for machine learning guided biomarker discovery using neonatal sepsis RNA-seq data. Frontiers in Genetics 14 , 1158352. 10.3389/fgene.2023.1158352

[thumbnail of fgene-14-1158352.pdf] PDF - Published Version
Available under License Creative Commons Attribution.

Download (1MB)


Machine learning (ML) algorithms are powerful tools that are increasingly being used for sepsis biomarker discovery in RNA-Seq data. RNA-Seq datasets contain multiple sources and types of noise (operator, technical and non-systematic) that may bias ML classification. Normalisation and independent gene filtering approaches described in RNA-Seq workflows account for some of this variability and are typically only targeted at differential expression analysis rather than ML applications. Pre-processing normalisation steps significantly reduce the number of variables in the data and thereby increase the power of statistical testing, but can potentially discard valuable and insightful classification features. A systematic assessment of applying transcript level filtering on the robustness and stability of ML based RNA-seq classification remains to be fully explored. In this report we examine the impact of filtering out low count transcripts and those with influential outliers read counts on downstream ML analysis for sepsis biomarker discovery using elastic net regularised logistic regression, L1-reguarlised support vector machines and random forests. We demonstrate that applying a systematic objective strategy for removal of uninformative and potentially biasing biomarkers representing up to 60% of transcripts in different sample size datasets, including two illustrative neonatal sepsis cohorts, leads to substantial improvements in classification performance, higher stability of the resulting gene signatures, and better agreement with previously reported sepsis biomarkers. We also demonstrate that the performance uplift from gene filtering depends on the ML classifier chosen, with L1-regularlised support vector machines showing the greatest performance improvements with our experimental data.

Item Type: Article
Date Type: Published Online
Status: Published
Schools: Medicine
Computer Science & Informatics
Publisher: Frontiers Media
ISSN: 1664-8021
Date of First Compliant Deposit: 13 April 2023
Date of Acceptance: 29 March 2023
Last Modified: 30 Jun 2024 01:20

Citation Data

Cited 2 times in Scopus. View in Scopus. Powered By Scopus® Data

Actions (repository staff only)

Edit Item Edit Item


Downloads per month over past year

View more statistics