Buchan, DW;
Jones, DT;
(2020)
Learning a Functional Grammar of Protein Domains using Natural Language Word Embedding Techniques.
Proteins
, 88
(4)
pp. 616-624.
10.1002/prot.25842.
Preview |
Text
word2vec_final.pdf - Accepted Version Download (835kB) | Preview |
Abstract
In this paper, using word2vec, a widely-used natural language processing method, we demonstrate that proteins domains may have a learnable implicit semantic "meaning" in the context of their functional contributions to multi-domain proteins in which they are found. Word2vec is a group of models which can be used to produce semantically meaningful embeddings of words or tokens in a fixed-dimension vector space. In this work, we treat multi-domain proteins as "sentences" where domain identifiers are tokens which may be considered as "words". Using all InterPro [1] pfam domain assignments we observe that the embedding could be used to suggest putative GO assignments for Pfam [2] Domains of Unknown Function. This article is protected by copyright. All rights reserved.
Type: | Article |
---|---|
Title: | Learning a Functional Grammar of Protein Domains using Natural Language Word Embedding Techniques |
Location: | United States |
Open access status: | An open access version is available from UCL Discovery |
DOI: | 10.1002/prot.25842 |
Publisher version: | https://doi.org/10.1002/prot.25842 |
Language: | English |
Additional information: | This version is the author accepted manuscript. For information on re-use, please refer to the publisher’s terms and conditions. |
Keywords: | Semantic embedding, function prediction, machine learning, protein domains, word2vec |
UCL classification: | UCL UCL > Provost and Vice Provost Offices > UCL BEAMS UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science |
URI: | https://discovery-pp.ucl.ac.uk/id/eprint/10086769 |
Archive Staff Only
![]() |
View Item |