Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Sequential data selection for predicting the pathogenic effects of sequence variation

Rogers, Mark F., Campbell, Colin, Shihab, Hashem A., Gaunt, Tom R., Mort, Matthew and Cooper, David Neil ORCID: https://orcid.org/0000-0002-8943-8484 2015. Sequential data selection for predicting the pathogenic effects of sequence variation. Presented at: 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Washington DC, USA, 9-12 November 2015. Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference on. IEEE, pp. 639-644. 10.1109/BIBM.2015.7359759

Full text not available from this repository.

Abstract

Recent improvements in sequencing technologies provide unprecedented opportunities to investigate the role of genetic variation in human disease. In previous work we have proposed a machine learning approach to predicting whether single nucleotide variants (SNVs) are functional or neutral in human disease. Many data sources from the Encyclopaedia of DNA Elements (ENCODE) may be relevant to this problem. To integrate these data sources, we applied integrative multiple kernel learning (MKL) that weights each source according to its relevance. Using an MKL optimization that yields sparse weights, we were able to eliminate the least informative data sources from our model. However, when selecting from a wide assortment of data sources, we have found that MKL may not be an efficient method for eliminating uninformative sources. Many data sources related to the human genome are incomplete: this can reduce dramatically the data available for training and the proportion of novel predictions that exploit all relevant sources. Here we introduce a greedy sequential selection method that assesses data sources in a structured fashion prior to MKL weight optimization. This method allows us to eliminate a majority of uninformative data sources prior to assigning kernel weights. When we use this method with our coding-region predictor, we select just five kernels for our final model, yielding increased accuracy over our previous model. In addition, by reducing the amount of data required for novel predictions, we are able to increase by five fold our model's coverage for new predictions.

Item Type: Conference or Workshop Item (Paper)
Status: Published
Schools: Medicine
Subjects: Q Science > QH Natural history > QH426 Genetics
Publisher: IEEE
Last Modified: 01 Nov 2022 10:16
URI: https://orca.cardiff.ac.uk/id/eprint/90892

Citation Data

Cited 4 times in Scopus. View in Scopus. Powered By Scopus® Data

Actions (repository staff only)

Edit Item Edit Item