OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches.

Fiche du document

Date

29 septembre 2021

Type de document
Périmètre
Langue
Identifiants
Relations

Ce document est lié à :
info:eu-repo/semantics/altIdentifier/doi/10.1093/bioinformatics/btab219

Ce document est lié à :
info:eu-repo/semantics/altIdentifier/pmid/33787851

Ce document est lié à :
info:eu-repo/semantics/altIdentifier/eissn/1367-4811

Ce document est lié à :
info:eu-repo/semantics/altIdentifier/urn/urn:nbn:ch:serval-BIB_E9D32F404F199

Licences

info:eu-repo/semantics/openAccess , CC BY 4.0 , https://creativecommons.org/licenses/by/4.0/



Sujets proches En

Assignments

Citer ce document

V. Rossier et al., « OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches. », Serveur académique Lausannois, ID : 10.1093/bioinformatics/btab219


Métriques


Partage / Export

Résumé 0

Assigning new sequences to known protein families and subfamilies is a prerequisite for many functional, comparative and evolutionary genomics analyses. Such assignment is commonly achieved by looking for the closest sequence in a reference database, using a method such as BLAST. However, ignoring the gene phylogeny can be misleading because a query sequence does not necessarily belong to the same subfamily as its closest sequence. For example, a hemoglobin which branched out prior to the hemoglobin alpha/beta duplication could be closest to a hemoglobin alpha or beta sequence, whereas it is neither. To overcome this problem, phylogeny-driven tools have emerged but rely on gene trees, whose inference is computationally expensive. Here, we first show that in multiple animal and plant datasets, 18-62% of assignments by closest sequence are misassigned, typically to an over-specific subfamily. Then, we introduce OMAmer, a novel alignment-free protein subfamily assignment method, which limits over-specific subfamily assignments and is suited to phylogenomic databases with thousands of genomes. OMAmer is based on an innovative method using evolutionarily informed k-mers for alignment-free mapping to ancestral protein subfamilies. Whilst able to reject non-homologous family-level assignments, we show that OMAmer provides better and quicker subfamily-level assignments than approaches relying on the closest sequence, whether inferred exactly by Smith-Waterman or by the fast heuristic DIAMOND. OMAmer is available from the Python Package Index (as omamer), with the source code and a precomputed database available at https://github.com/DessimozLab/omamer. Supplementary data are available at Bioinformatics online.

document thumbnail

Par les mêmes auteurs

Sur les mêmes sujets

Exporter en