Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Which species is it? Species-driven gene name disambiguation using random walks over a mixture of adjacency matrices.

Harmston, Nathan, Filsell, Wendy and Stumpf, Michael P.H. 2011. Which species is it? Species-driven gene name disambiguation using random walks over a mixture of adjacency matrices. Bioinformatics 28 (2) , pp. 254-260. 10.1093/bioinformatics/btr640

Full text not available from this repository.

Abstract

The scientific literature contains a wealth of information about biological systems. Manual curation lacks the scalability to extract this information due to the ever-increasing numbers of papers being published. The development and application of text mining technologies has been proposed as a way of dealing with this problem. However, the inter-species ambiguity of the genomic nomenclature makes mapping of gene mentions identified in text to their corresponding Entrez gene identifiers an extremely difficult task. We propose a novel method, which transforms a MEDLINE record into a mixture of adjacency matrices; by performing a random walkover the resulting graph, we can perform multi-class supervised classification allowing the assignment of taxonomy identifiers to individual gene mentions. The ability to achieve good performance at this task has a direct impact on the performance of normalizing gene mentions to Entrez gene identifiers. Such graph mixtures add flexibility and allow us to generate probabilistic classification schemes that naturally reflect the uncertainties inherent, even in literature-derived data.Our method performs well in terms of both micro- and macro-averaged performance, achieving micro-F(1) of 0.76 and macro-F(1) of 0.36 on the publicly available DECA corpus. Re-curation of the DECA corpus was performed, with our method achieving 0.88 micro-F(1) and 0.51 macro-F(1). Our method improves over standard classification techniques [such as support vector machines (SVMs)] in a number of ways: flexibility, interpretability and its resistance to the effects of class bias in the training data. Good performance is achieved without the need for computationally expensive parse tree generation or 'bag of words classification'.

Item Type: Article
Date Type: Publication
Status: Published
Schools: Biosciences
Publisher: Oxford University Press
ISSN: 1367-4803
Date of Acceptance: 15 November 2011
Last Modified: 25 Aug 2023 11:45
URI: https://orca.cardiff.ac.uk/id/eprint/161877

Actions (repository staff only)

Edit Item Edit Item