Scalable and Highly Accurate Methods for Metagenomics

This project is supported by the National Science Foundation (NSF), through grant number 1513629. This is a collaborative grant with the University of Maryland at College Park (PI: Mihai Pop).

Dates: September 1, 2015 to August 31, 2020

The project reports document can be found at this NSF webpage

Personnel

UIUC

PI: Tandy Warnow, Professor of Computer Science and of Bioengineering
Co-PI: Bill Gropp, Professor of Computer Science and Interim Director of the National Center for Supercomputing Applications (NCSA)
Erin Molloy, PhD student in Computer Science
Nam-phuong Nguyen, postdoctoral researcher (now at UCSD)
Michael Nute, PhD student in Statistics

University of Maryland

PI: Mihai Pop, Professor of Computer Science and Interim Director of UMIACS
Jeremy Selengut, Associate Research Scientist at UMIACS
Todd Treangen, Assistant Professor of Computer Science, Rice University
Nidhi Shah, PhD student at Maryland of Mihai Pop

Project Summary

Metagenomic studies of microbial communities can generate millions to billions of sequencing reads. The assignment of accurate taxonomic labels to these sequences is a critical component in many analyses, but is complicated by the fact that the majority of the organisms found in environmental or host-associated communities cannot be easily cultured in a laboratory. Even among the organisms that can be cultured, relatively few have been sequenced, even partially. Thus, many commonly encountered organisms are largely absent from existing databases of known genomes and genes. Providing taxonomic labels to metagenomic sequences, thus, requires extrapolating the knowledge contained in sequence databases to previously unseen DNA strings. Simple similarity-based approaches (e.g., picking the best database hit as the best guess at the taxonomic label) have been shown to be insufficiently accurate, leading to the development of more sophisticated methods. Further developments are necessary to handle the characteristics of emerging sequencing technologies, such as high error rates with large numbers of insertions and deletions. To date, metagenomic taxon identification methods have been evaluated with respect to their ability to estimate the distribution of bacterial taxa (species, genera, families, etc.) within a metagenomic sample. Yet, different scientific and clinical settings may require specific types of analyses, and this one type of evaluation may not be the most appropriate for all settings. For example, in a clinical setting the most important question may be to detect whether a specific pathogen is present, while in a scientific setting the most interesting question may be to be able to determine if an observed read comes from a never-been-seen-before species. New evaluation strategies must be developed that specifically target the specific needs of the application domain. We will address the challenges outlined above as follows. First, we will develop a new framework for integrating the formal definition of biological use-cases with evaluation datasets and metrics in order to ensure the software being developed adequately addresses the needs of the end-users. Second, we will develop new approaches for marker-based taxon identification and abundance profiling that can leverage multiple sources of information (e.g., multiple markers) as well as handle the high error rates of third-generation sequencing technologies. These approaches will build upon our experience developing TIPP - a taxonomic profiling package recently published by us that outperforms the leading metagenomic taxonomic profiling software, in particular for novel sequences, or for longer, high-error sequences. Finally we plan to develop high-performance computing implementations of these methods in order to enable rapid analysis of sample. Speed of analysis is particularly important in clinical settings where medical treatments may depend on the rate at which the method can return an analysis. Speed is also important in non-medical applications where faster analyses enable researchers to perform deeper or broader analyses of microbial communities. All the methods developed in the project will be made into open-source software that is freely available to the scientific public. We will provide training activities each year with funds available to students and postdocs from around the country, and an outreach program to minority serving institutions and women's colleges. A summer REU program will also be provided at the University of Maryland, College Park.

Publications supported by the grant to UIUC

2016

N. Nguyen, T. Warnow, M. Pop, and B. White (2016). "A perspective on 16S rRNA operational taxonomic unit clustering using sequence similarity," Npj Biofilms And Microbiomes, v.2, 2016. doi:doi:10.1038/npjbiofilms.2016.4
N. Nguyen, M. Nute, S. Mirarab, and T. Warnow (2016). HIPPI: Highly accurate protein family classification with ensembles of HMMs. BMC Genomics 17 (Suppl 10):765, special issue for RECOMB-CG.
M. Nute and T. Warnow (2016). Scaling statistical multiple sequence alignment to large datasets. BMC Genomics 17 (Suppl 10):764, special issue for RECOMB-CG.
T. Hansen, S. Mollerup, N. Nguyen, L. Vinner, N. White, M. Coghlan, D. Alquezar-Planas, T. Joshi, R. Jensen, H. Fridholm, K. Kjaransdottir, T. Mourier, T. Warnow, G. Belsham, T. Gilbert, L. Orlando, M. Bunce, E. Willerslev, L. Nielsen, and A. Hansen (2016). High diversity of picornaviruses in rats from different continents revealed by deep sequencing, Emerging Microbes & Infections 5, e90, doi:doi:10.1038/emi.2016.90.

2017

B.M. Boyd, J.M. Allen, N. Nguyen, A.D. Sweet, T. Warnow, M.D. Shapiro, S.M. Villa, S.E. Bush, D.H. Clayton, and K.P. Johnson (2017). Phylogenomics using Target-restricted Assembly Resolves Intra-generic Relationships of Parasitic Lice (Phthiraptera: Columbicola). Systematic Biology 2017, doi: 10.1093/sysbio/syx027.

2019

N. Shah, M. Nute, T. Warnow, and M. Pop. Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows. Bioinformatics vol. 35, issue 9, 2019, pp. 1613-1614, https://doi.org/10.1093/bioinformatics/bty833
Katherine R Amato, Jon G Sanders, Se Jin Song, Michael Nute, Jessica L Metcalf, Luke R Thompson, James T Morton, Amnon Amir, Valerie J McKenzie, Gregory Humphrey, Grant Gogul, James Gaffney, Andrea L Baden, Gillian AO Britton, Frank P Cuozzo, Anthony Di Fiore, Nathaniel J Dominy, Tony L Goldberg, Andres Gomez, Martin M Kowalewski, Rebecca J Lewis, Andres Link, Michelle L Sauther, Stacey Tecot, Bryan A White, Karen E Nelson, Rebecca M Stumpf, Rob Knight, and Steven R Leigh (2019). Evolutionary trends in host physiology outweigh dietary niche in structuring primate gut microbiomes. The ISME Journal (2019), 13:576-587, doi: 10.1038/s41396-018-0175-0.
S. Pattabiraman and T. Warnow (2019). Profile Hidden Markov Models are not Identifiable. IEEE/ACM Transactions on Computational Biology, in press, DOI 10.1109/TCBB.2019.2933821.

2020

S. Christensen, E.K. Molloy, P. Vachaspati, A. Yamanuru, and T. Warnow (2020). Non-parametric correction of estimated gene trees using TRACTION.. Algorithms Mol Biol. 15 (1), DOI: 10.1186/s13015-019-0161-8
E.K Molloy and T. Warnow (2020). FastMulRFS: fast and accurate species tree estimation under generic gene duplication and loss models,. Bioinformatics. 36 i57. DOI: 10.1093/bioinformatics/btaa444
B. Legried, E.K. Molloy, T. Warnow, and S. Roch. (2020). Polynomial-Time Statistical Estimation of Species Trees Under Gene Duplication and Loss. Research in Computational Molecular Biology (RECOMB 2020), Lecture Notes in Computer Science. 12074 120. DOI: 10.1007/978-3-030-45257-5_8
E.K. Molloy and T. Warnow. (2019). Statistically consistent divide-and-conquer pipelines for phylogeny estimation using NJMerge.. Algorithms Mol Biol. 14 14. DOI: 10.1186/s13015-019-0151-x
N. Shah, E.K. Molloy, M. Pop, and T. Warnow (2020). TIPP2: metagenomic taxonomic profiling using phylogenetic markers. Submitted
V. Smirnov and T. Warnow (2020). Unblended Disjoint Tree Merging using GTM improves species tree estimation. BMC Genomics. 21 235. DOI: 10.1186/s12864-020-6605-1
V. Smirnov and T. Warnow (2020). Phylogeny Estimation Given Sequence Length Heterogeneity. Systematic Biology. DOI: 10.1093/sysbio/syaa058
X. Yu, T. Le, S.A. Christensen, E.K. Molloy, T. Warnow (2020). Advancing Divide-and-Conquer Phylogeny Estimation using Robinson-Foulds Supertrees. 20th International Workshop on Algorithms in Bioinformatics (WABI 2020). Editors: Carl Kingsford and Nadia Pisanti Leibniz International Proceedings in Informatics (LIPICS). DOI: 10.4230/LIPIcs.WABI.2020.0

In preparation

M. Nute, K. Yarlagadda, and R. Stumpf (2019). PICAN-PI: A Graphical Schema to Visualize Microbial Biodiversity.

Project Software

TIPP: taxonomic identification using phylogeny-aware profiles. TIPP is available at the github page for SEPP, which also includes code for SEPP (phylogenetic placement), UPP (ultra-large alignment using ensembles of HMMs), and HIPPI (protein family identification for protein sequences), all methods that exploit the Ensemble of HMM technique developed initially for SEPP.
HIPPI: gene binning for protein sequences, using ensembles of HMMs. (This is available on github at the website for SEPP, see above)
PICAN-PI, open-source software available at github, is a visualization tool for use in microbiome analysis with phylogenetic placement approaches. This is the work of Mike Nute (PhD student).

Conferences and Software Schools

August 28 - September 2, 2016. Next Generation Sequencing - Algorithms, and Software For Biomedical Applications. Mihai Pop and Tandy Warnow were two of the co-organizers of this Dagstuhl seminar.
June 5-6, 2017. Advancing Genomic Biology through Novel Method Development. Tandy Warnow was the organizer and Mihai Pop was one of the speakers.
July 30-Aug 9, 2017. Woods Hole, MA. Strategies and Techniques for Analyzing Microbial Population Structure (STAMPS), Mihai Pop and Tandy Warnow lectured at this summer school on methods for microbiome analysis, and Erin Molloy taught a workshop on how to use TIPP for amplicon sequence analysis.
July 30-Aug 9, 2018. Woods Hole, MA. Strategies and Techniques for Analyzing Microbial Population Structure (STAMPS). Mihai Pop and Tandy Warnow lectured at this summer school on methods for microbiome analysis, and Mike Nute taught a workshop on how to use TIPP and SEPP for both amplicon and metagenomic datasets.
July 22-Aug 1, 2019. STAMPS course at Woods Hole. Mihai Pop and Tandy Warnow lectured at this summer school on methods for microbiome analysis, and Mike Nute taught a workshop on TIPP and SEPP. Tandy's lectures: (PDF) (PPT)

Presentations

November 9, 2015. UCSD Distinguished Lecture, Department of Computer Science.(PPT) (PDF)
January 4, 2016. Pacific Symposium on Biocomputing, Special Session on Microbiome Analysis. (PPT) (PDF)
January 12, 2016. UC San Francisco. (PDF)
August 28-September 2, 2016. Using Ensembles of HMMs for Grand Challenges in Bioinformatics, as part of the Schloss Dagstuhl seminar Next generation sequencing - Algorithms and Software for Biomedical Applications"
October 11-14, 2016. RECOMB-CG, Montreal. Scaling statistical multiple sequence alignment to large datasets. (PDF) (PPTX)
October 17, 2016. Georgia Tech, CSE Department Distinguished Lecture. Genome-scale estimation of the Tree of Life. (PDF) (PPT)
November 2, 2016. Mid-Atlantic Microbiome Meeting (M3), at the University of Maryland. (PPT) (PDF)
November 17, 2016. NIH (PPT) (PDF)
February 16-17, 2017. Second Workshop on Statistical and Algorithmic Challenges in Microbiome Data Analysis at The Broad Institute of MIT and Harvard, in Cambridge, MA. (PDF)
April 5-6, 2017. NeLLi: From New Lineages of Life to New Functions at the DOE Joint Genome Institute (JGI). (PDF)
March 16, 2018. Bioinformatics seminar, UCLA, Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics (PDF)
August 1, 2018. Marine Biological Laboratory at Woods Hole, STAMPS program. TIPP and SEPP: Metagenomic analysis using phylogeny-aware profiles (PDF)
February 13, 2019. University of Pennsylvania, Penn Bioinformatics Forum. Title: Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics (PPTX) (PDF)
April 30, 2019. European Bioinformatics Institute, Cambridge UK (PPTX) (PDF)
July 26, 2019. STAMPS course at Woods Hole. Title: TIPP, SEPP, and PASTA. (PDF) (PPT)

For the full list of talks, see this page.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.