Zeller Group, EMBL Heidelberg, Service center: Heidelberg Center for Human Bioinformatics – HD-HuB

Advances in sequencing technology have enabled the analysis of various microbial communities from the human gut to the oceans. In the ever-growing space of (meta)genomic sequences, many stem from novel organisms not represented in reference databases. Universal phylogenetic marker genes have proven instrumental for the taxonomic analysis of these uncharacterized microbes. However, the central analytical task of correctly assigning a taxonomic lineage to prokaryotic amplicon sequences, genes, and genomes is still challenging and existing methods lack resolution, make many false assignments, or are computationally demanding. To address these issues, we leveraged machine learning methods integrated into a newly developed software called STAG (short for Supervised Taxonomic Analysis framework for marker Genes), a new de.NBI tool candidate. It employs a hierarchy of machine-learning classifiers, which extract informative features from a multiple sequence alignment of marker genes to distinguish sister clades at each branch in the taxonomic tree. STAG not only improves the resolution and accuracy of taxonomic assignments, but also uses computational resources (CPU time and memory) more efficiently than competing tools. Due to its novel machine-learning based architecture translating into superior accuracy in the central task of taxonomic classification of genes and genomes, STAG has the potential to lead to biological discoveries from the rapidly growing number of prokaryotic genomes resulting from high-throughput isolate sequencing and metagenomic assembly.

For further information, visit github or the Zeller team website at EMBL Heidelberg.