Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Machine learning for the genetic prediction of Alzheimer’s Disease

Rowe, Thomas 2023. Machine learning for the genetic prediction of Alzheimer’s Disease. PhD Thesis, Cardiff University.
Item availability restricted.

[thumbnail of 2023RoweT PhD.pdf]
Preview
PDF - Accepted Post-Print Version
Download (8MB) | Preview
[thumbnail of Cardiff University Electronic Theses and Dissertations Form - Memorandum of Understanding] PDF (Cardiff University Electronic Theses and Dissertations Form - Memorandum of Understanding) - Supplemental Material
Restricted to Repository staff only

Download (355kB)

Abstract

Alzheimer’s disease (AD) is the most common form of dementia in humans, with disease course involving initial memory loss, a subsequent debilitative state and eventually death. It is a polygenic disorder, meaning its genetic component comprises many known and unknown mutations. This complexity alongside further influences from a range of lifestyle factors, have made the prediction of disease risk a challenging pursuit. The initial attempts to predict AD risk from genetic data arose due to the identification of risk loci in genome wide association studies (GWAS). Resulting variants are used to assess risk of disease onset through polygenic risk scoring (PRS). This score is generated through the summation of risk alleles multiplied by their respective effect sizes derived from GWAS. Publication results demonstrate PRS to be a useful method for assessing lifetime risk, however it has also been proven that PRS can only cover a fraction of genetic liability for AD. A possible explanation for this inadequacy is the inability for PRS to assess non-linear relationships between loci due to the use of linear modelling. Given AD is a complex polygenic disorder, it is likely that onset is the result of interactions between loci. A format which holds the capability to analyse non-linear patterns is machine learning (ML). Interest in these algorithms has increased in recent decades due to their predictive power, ability to analyse large datasets, and capabilities in disease prediction. Initial results demonstrated a superior performance for PRS compared to ML when using datasets comprising smalls amount of AD associated single nucleotide polymorphisms (SNPs). However, in some instances ML achieved accuracies close to that of PRS. This occurred when using the algorithm support vector machine with various kernels. However, it was acknowledged these algorithms would result in excessive training times when using larger datasets in subsequent chapters. Therefore, only decision tree-based algorithms were employed moving forwards. It was also deduced that techniques such as balancing by age and sex had made no discernible difference on model performance. Further investigation involved the use of variants sourced on a genome wide scale, as it was reasoned that using a greater number of SNPs might improve upon results from the previous 4 chapter. However, increasing the number of variants resulted in issues relating to high dimensionality. Despite efforts to alleviate this through the use of feature selection techniques, prediction performance for ML models was still inferior to PRS. Further avenues were also explored such as using a more lenient threshold of r2 when clumping and removing this step completely for SNP selection, but this again failed to improve upon ML prediction accuracy. PRS continued to achieve better performance when using an imputed version of the dataset used in previous analyses, this was still evident when again exploring method such as feature selection. However, the observed difference between ML and PRS was reduced in the final investigations conducted in this thesis. Analysis on datasets comprising SNPs derived from biologically associated AD pathways resulted in improved ML performance. This result identified the possibility of focusing on the underpinning biological mechanisms of AD when selecting datasets.

Item Type: Thesis (PhD)
Date Type: Completion
Status: Unpublished
Schools: Medicine
Date of First Compliant Deposit: 3 May 2024
Last Modified: 03 May 2024 15:47
URI: https://orca.cardiff.ac.uk/id/eprint/168711

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

View more statistics