Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Distributed learning on 20 000+ lung cancer patients - The Personal Health Train

Deist, Timo M., Dankers, Frank J. W. M., Ojha, Priyanka, Marshall, M. Scott, Janssen, Tomas, Faivre-Finn, Corinne, Masciocchi, Carlotta, Valentini, Vincenzo, Wang, Jiazhou, Chen, Jiayan, Zhang, Zhen, Spezi, Emiliano, Button, Mick, Nuyttens, Joost Jan, Vernhout, René, van Soest, Johan, Jochems, Arthur, Monshouwer, René, Bussink, Johan, Price, Gareth, Lambin, Philippe and Dekker, Andre 2020. Distributed learning on 20 000+ lung cancer patients - The Personal Health Train. Radiotherapy and Oncology 144 , pp. 189-200. 10.1016/j.radonc.2019.11.019

[thumbnail of PIIS0167814019334899.pdf]
PDF - Published Version
Available under License Creative Commons Attribution Non-commercial No Derivatives.

Download (1MB) | Preview


Background and purpose Access to healthcare data is indispensable for scientific progress and innovation. Sharing healthcare data is time-consuming and notoriously difficult due to privacy and regulatory concerns. The Personal Health Train (PHT) provides a privacy-by-design infrastructure connecting FAIR (Findable, Accessible, Interoperable, Reusable) data sources and allows distributed data analysis and machine learning. Patient data never leaves a healthcare institute. Materials and methods Lung cancer patient-specific databases (tumor staging and post-treatment survival information) of oncology departments were translated according to a FAIR data model and stored locally in a graph database. Software was installed locally to enable deployment of distributed machine learning algorithms via a central server. Algorithms (MATLAB, code and documentation publicly available) are patient privacy-preserving as only summary statistics and regression coefficients are exchanged with the central server. A logistic regression model to predict post-treatment two-year survival was trained and evaluated by receiver operating characteristic curves (ROC), root mean square prediction error (RMSE) and calibration plots. Results In 4 months, we connected databases with 23 203 patient cases across 8 healthcare institutes in 5 countries (Amsterdam, Cardiff, Maastricht, Manchester, Nijmegen, Rome, Rotterdam, Shanghai) using the PHT. Summary statistics were computed across databases. A distributed logistic regression model predicting post-treatment two-year survival was trained on 14 810 patients treated between 1978 and 2011 and validated on 8 393 patients treated between 2012 and 2015. Conclusion The PHT infrastructure demonstrably overcomes patient privacy barriers to healthcare data sharing and enables fast data analyses across multiple institutes from different countries with different regulatory regimens. This infrastructure promotes global evidence-based medicine while prioritizing patient privacy.

Item Type: Article
Date Type: Publication
Status: Published
Schools: Engineering
Additional Information: This is an open access article under the terms of the CC-BY Attribution 4.0 International license.
Publisher: Elsevier
ISSN: 0167-8140
Date of First Compliant Deposit: 5 December 2019
Date of Acceptance: 19 November 2019
Last Modified: 04 Dec 2020 11:45

Citation Data

Cited 37 times in Scopus. View in Scopus. Powered By Scopus® Data

Actions (repository staff only)

Edit Item Edit Item


Downloads per month over past year

View more statistics