eprintid: 10190972
rev_number: 6
eprint_status: archive
userid: 699
dir: disk0/10/19/09/72
datestamp: 2024-04-19 10:51:19
lastmod: 2024-04-19 10:51:19
status_changed: 2024-04-19 10:51:19
type: article
metadata_visibility: show
sword_depositor: 699
creators_name: Lin, W
creators_name: Wells, J
creators_name: Wang, Z
creators_name: Orengo, C
creators_name: Martin, ACR
title: Enhancing missense variant pathogenicity prediction with protein language models using VariPred
ispublished: pub
divisions: UCL
divisions: B02
divisions: C08
divisions: D09
divisions: G03
keywords: Virulence, Proteins, Mutation, Missense, Amino Acid Sequence, Computational Biology
note: This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
abstract: Computational approaches for predicting the pathogenicity of genetic variants have advanced in recent years. These methods enable researchers to determine the possible clinical impact of rare and novel variants. Historically these prediction methods used hand-crafted features based on structural, evolutionary, or physiochemical properties of the variant. In this study we propose a novel framework that leverages the power of pre-trained protein language models to predict variant pathogenicity. We show that our approach VariPred (Variant impact Predictor) outperforms current state-of-the-art methods by using an end-to-end model that only requires the protein sequence as input. Using one of the best-performing protein language models (ESM-1b), we establish a robust classifier that requires no calculation of structural features or multiple sequence alignments. We compare the performance of VariPred with other representative models including 3Cnet, Polyphen-2, REVEL, MetaLR, FATHMM and ESM variant. VariPred performs as well as, or in most cases better than these other predictors using six variant impact prediction benchmarks despite requiring only sequence data and no pre-processing of the data.
date: 2024-12-01
date_type: published
publisher: Springer Science and Business Media LLC
official_url: http://dx.doi.org/10.1038/s41598-024-51489-7
oa_status: green
full_text_type: pub
language: eng
primo: open
primo_central: open_green
verified: verified_manual
elements_id: 2268559
doi: 10.1038/s41598-024-51489-7
medium: Electronic
pii: 10.1038/s41598-024-51489-7
lyricists_name: Orengo, Christine
lyricists_name: Martin, Andrew
lyricists_id: CAORE63
lyricists_id: ACRMA18
actors_name: Bracey, Alan
actors_id: ABBRA90
actors_role: owner
full_text_status: public
publication: Scientific Reports
volume: 14
number: 1
article_number: 8136
event_location: England
citation:        Lin, W;    Wells, J;    Wang, Z;    Orengo, C;    Martin, ACR;      (2024)    Enhancing missense variant pathogenicity prediction with protein language models using VariPred.                   Scientific Reports , 14  (1)    , Article 8136.  10.1038/s41598-024-51489-7 <https://doi.org/10.1038/s41598-024-51489-7>.       Green open access   
 
document_url: https://discovery-pp.ucl.ac.uk/id/eprint/10190972/1/s41598-024-51489-7%20%281%29.pdf