Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Analysis of missense variants in the human genome reveals widespread gene-specific clustering and improves prediction of pathogenicity

Quinodoz, Mathieu, Peter, Virginie G., Cisarova, Katarina, Royer-Bertrand, Beryl, Stenson, Peter D., Cooper, David N. ORCID:, Unger, Sheila, Superti-Furga, Andrea and Rivolta, Carlo 2022. Analysis of missense variants in the human genome reveals widespread gene-specific clustering and improves prediction of pathogenicity. American Journal of Human Genetics 109 (3) , pp. 457-470. 10.1016/j.ajhg.2022.01.006

[thumbnail of PIIS0002929722000064.pdf] PDF - Published Version
Available under License Creative Commons Attribution Non-commercial No Derivatives.

Download (4MB)


We used a machine learning approach to analyze the within-gene distribution of missense variants observed in hereditary conditions and cancer. When applied to 840 genes from the ClinVar database, this approach detected a significant non-random distribution of pathogenic and benign variants in 387 (46%) and 172 (20%) genes, respectively, revealing that variant clustering is widespread across the human exome. This clustering likely occurs as a consequence of mechanisms shaping pathogenicity at the protein level, as illustrated by the overlap of some clusters with known functional domains. We then took advantage of these findings to develop a pathogenicity predictor, MutScore, that integrates qualitative features of DNA substitutions with the new additional information derived from this positional clustering. Using a random forest approach, MutScore was able to identify pathogenic missense mutations with very high accuracy, outperforming existing predictive tools, especially for variants associated with autosomal-dominant disease and cancer. Thus, the within-gene clustering of pathogenic and benign DNA changes is an important and previously underappreciated feature of the human exome, which can be harnessed to improve the prediction of pathogenicity and disambiguation of DNA variants of uncertain significance.

Item Type: Article
Date Type: Publication
Status: Published
Schools: Medicine
Additional Information: This is an open access article under the CC BY-NC-ND license (
Publisher: Cell Press
ISSN: 0002-9297
Date of First Compliant Deposit: 16 February 2022
Date of Acceptance: 11 January 2022
Last Modified: 04 May 2023 18:02

Citation Data

Cited 5 times in Scopus. View in Scopus. Powered By Scopus® Data

Actions (repository staff only)

Edit Item Edit Item


Downloads per month over past year

View more statistics