Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Machine learning for prediction of schizophrenia using genetic and demographic factors in the UK Biobank

Bracher-Smith, Matthew, Rees, Elliott ORCID:, Menzies, Georgina ORCID:, Walters, James ORCID:, O'Donovan, Michael ORCID:, Owen, Michael J. ORCID:, Kirov, George ORCID: and Escott-Price, Valentina ORCID: 2022. Machine learning for prediction of schizophrenia using genetic and demographic factors in the UK Biobank. Schizophrenia Research 246 , pp. 156-164. 10.1016/j.schres.2022.06.006

[thumbnail of 1-s2.0-S0920996422002407-main.pdf] PDF - Published Version
Available under License Creative Commons Attribution.

Download (3MB)


Machine learning (ML) holds promise for precision psychiatry, but its predictive performance is unclear. We assessed whether ML provided added value over logistic regression for prediction of schizophrenia, and compared models built using polygenic risk scores (PRS) or clinical/demographic factors. LASSO and ridge-penalised logistic regression, support vector machines (SVM), random forests, boosting, neural networks and stacked models were trained to predict schizophrenia, using PRS for schizophrenia (PRSSZ), sex, parental depression, educational attainment, winter birth, handedness and number of siblings as predictors. Models were evaluated for discrimination using area under the receiver operator characteristic curve (AUROC) and relative importance of predictors using permutation feature importance (PFI). In a secondary analysis, fitted models were tested for association with schizophrenia-related traits which had not been used in model development. Following learning curve analysis, 738 cases and 3690 randomly sampled controls were selected from the UK Biobank. ML models combining all predictors showed the highest discrimination (linear SVM, AUROC = 0.71), but did not significantly outperform logistic regression. AUROC was robust over 100 random resamples of controls. PFI identified PRSSZ as the most important predictor. Highest variance in fitted models was explained by schizophrenia-related traits including fluid intelligence (most associated: linear SVM), digit symbol substitution (RBF SVM), BMI (XGBoost), smoking status (XGBoost) and deprivation (linear SVM). In conclusion, ML approaches did not provide substantial added value for prediction of schizophrenia over logistic regression, as indexed by AUROC; however, risk scores derived with different ML approaches differ with respect to association with schizophrenia-related traits.

Item Type: Article
Date Type: Publication
Status: Published
Schools: Medicine
MRC Centre for Neuropsychiatric Genetics and Genomics (CNGG)
Additional Information: This is an open access article under the CC BY license (
Publisher: Elsevier
ISSN: 0920-9964
Funders: MRC
Date of First Compliant Deposit: 29 June 2022
Date of Acceptance: 11 June 2022
Last Modified: 04 Sep 2023 16:26

Citation Data

Cited 3 times in Scopus. View in Scopus. Powered By Scopus® Data

Actions (repository staff only)

Edit Item Edit Item


Downloads per month over past year

View more statistics