New methods for multiple sequence alignment with improved accuracy and scalability

Participants:

Tandy Warnow (PI)
Postdoctoral researchers:
- Nam-phuong Nguyen (2014-2016)
- Paul Zaharias (2020-2021)
Graduate students:
- Mike Nute, PhD student in Statistics (PhD received August 2019)
- Elizabeth Koning (rotation student)
- Minhyuk Park
- Srilakshmi Pattabiraman (rotation student)
- Ehsan Saleh (rotation student)
- Vladimir Smirnov (PhD received August 2021)
Undergraduate students (*: supported by REU funds)
- Nikhil Agarwal, Nicholas Chen (*) Binghui Cheng, Aditi Ghosalkar, Martin Grosshauser, Maya Gupta (*), Emma Hamel, Dylan Irlbeck (*), Thien Le, Shilpa Subramanyam, Kodi Taraska (née Collins,*) Ananya Yamanuru (*), Qikai Yang, Zitao Zhu.

Funding: U.S. National Science Foundation grant 1458652, 2014-2021. (ABI Innovation).

Project Overview:

Multiple sequence alignment (MSA) and phylogeny estimation are two very basic bioinformatics problems, which sit at the intersection of machine learning, statistical estimation, and evolutionary and structural biology. MSA has particular importance in constructing evolutionary trees, understanding the function and structure of proteins, detecting interactions between proteins, and even genome assembly. Large-scale MSA and phylogeny estimation also require high performance computing and parallel algorithms, in order to provide adequate scalability. The team will develop new machine learning techniques to greatly improve MSA methods, and hence also phylogeny estimation, since it depends on accurate multiple sequence alignments. The core of this project is algorithm development, utilizing a variety of machine learning techniques (including Hidden Markov Models), statistical estimation methods (especially Bayesian MCMC and maximum likelihood), and novel algorithmic strategies, all focused on improving scalability and accuracy.

This project (2014-2021) developed new MSA methods with greatly improved accuracy and scalability to large datasets, basic algorithms for related problems (e.g., protein classification), and evaluated the impact of alignments in biological discovery. Overall, 23 publications were published supported by this grant. One of the research highlights of this project is the new method MAGUS (Smirnov and Warnow, Bioinformatics 2021), which produces more accurate alignments on large datasets than previous methods and is able to analyze very large datasets (tens of thousands of sequences). Another significant methodological advance is HIPPI (Nguyen et al., BMC Genomics 2016), a new method that performs protein sequence classification with higher accuracy than standard methods (e.g., BLAST). We also explored techniques (Nute and Warnow, BMC Genomics 2016) that enable that the popular but very computationally intensive Bayesian method, Bali-Phy (Redelings and Suchard, Systematic Biology 2005), to be used on large datasets. A surprising discovery (Nute et al., Systematic Biology 2019) is that while Bali-Phy has outstanding accuracy on simulated datasets, it is not among the most accurate methods on protein benchmark datasets; this trend raises questions about the disconnect between biological benchmarks and the model-based approach in Bali-Phy. The broader impacts of this project include five (5) free symposia and software schools held at national scientific meetings, where students, postdocs, and faculty were taught new methods and software developed by this project and by others in the research community. The research from this project was also taught in both undergraduate and graduate courses at the University of Illinois. Finally, two graduate students completed their PhDs supported by this grant, and several others were trained by this project. 14 undergraduates were trained by the project, with five supported by REU funds.

Publications supported by this grant:

N. Nguyen, M. Nute, S. Mirarab, and T. Warnow (2016). HIPPI: Highly accurate protein family classification with ensembles of HMMs. BMC Genomics 17 (Suppl 10):765, special issue for RECOMB-CG. https://doi.org/10.1186/s12864-016-3097-0
M. Nute and T. Warnow (2016). Scaling statistical multiple sequence alignment to large datasets. BMC Genomics 17 (Suppl 10):764, special issue for RECOMB-CG. https://doi.org/10.1186/s12864-016-3101-8
J. Allen, B. Boyd, N-P. Nguyen, P. Vachaspati, T. Warnow, et al. (2017). Phylogenomics from whole genome sequences using aTRAM. Systematic Biology 66 (5), 786--798.
B. Boyd, J.M. Allen, N. Nguyen, P. Vachaspati, Z.S. Quicksall, T. Warnow, L. Mugisha, K.P. Johnson, and D.L. Reed (2017). Primates, Lice and Bacteria: Speciation and Genome Evolution in the Symbionts of Hominid Lice. Molecular Biology and Evolution 2017; 34 (7): 1743-1757. doi: 10.1093/molbev/msx117
B.M. Boyd, J.M. Allen, N. Nguyen, A.D. Sweet, T. Warnow, M.D. Shapiro, S.M. Villa, S.E. Bush, D.H. Clayton, and K.P. Johnson (2017). Phylogenomics using Target-restricted Assembly Resolves Intra-generic Relationships of Parasitic Lice (Phthiraptera: Columbicola). Systematic Biology 2017, doi: 10.1093/sysbio/syx027
K. Collins and T. Warnow (2018). PASTA for Proteins. Bioinformatics, https://doi.org/10.1093/bioinformatics/bty495, Volume 34, Issue 22, 15 November 2018, Pages 3939-3941
S. Pattabiraman and T. Warnow (2018). Are Profile Hidden Markov Models Identifiable? Proceedings of ACM-BCB 2018. DOI: 10.1145/3233547.3233563.
W. Eiserhardt, A. Antonelli, D.J. Bennett, L.R. Botigue, J.G. Burleigh, S. Dodsworth, B.J. Enquist, F. Forest, J.T. Kim, A.M. Kozlov, I.J. Leitch, B.S. Maitner, S. Mirarab, W.H. Piel, O.A. Perez-Escobar, L. Pokorny, C. Rahbek, B. Sandel, S.A. Smith, A. Stamatakis, R.A. Vos, T. Warnow, and W.J. Baker (2018). A roadmap for global synthesis of the plant tree of life. American Journal of Botany, 105(3): 614-622, DOI: 10.1002/ajb2.1041, 2018
K.P. Johnson, N. Nguyen, A.D. Sweet, B.M. Boyd, T. Warnow, and J.M. Allen (2018). Simultaneous radiation of bird and mammal lice following the K-Pg boundary. Biology Letters, Vol. 14, page 20180141, DOI: 10.1098/rsbl.2018.0141.
M. Nute, E. Saleh, and T. Warnow (2019). Evaluating statistical multiple sequence alignment in comparison to other alignment methods on protein data sets. Systematic Biology 2019, Volume 68, Issue 3, May 2019, Pages 396-411, https://doi.org/10.1093/sysbio/syy068
S. Pattabiraman and T. Warnow (2019). Profile Hidden Markov Models are not Identifiable. IEEE/ACM Transactions on Computational Biology, in press, DOI 10.1109/TCBB.2019.2933821.
S. Christensen, E. Molloy, P. Vachaspati, A. Yammanuru, and T. Warnow (2020). Non-parametric correction of estimated gene trees using TRACTION. Algorithms for Molecular Biology. 15 (1), 1--18.
V. Smirnov and T. Warnow (2020). MAGUS: Multiple Sequence Alignment using Graph Clustering. Bioinformatics, btaa992, https://doi.org/10.1093/bioinformatics/btaa992 (HTML)
V. Smirnov and T. Warnow (2021). Phylogeny Estimation Given Sequence Length Heterogeneity. Systematic Biology, Volume 70, Issue 2, March 2021, Pages 268--282, https://doi.org/10.1093/sysbio/syaa058L
P. Zaharias, V. Smirnov, and T. Warnow (2021). The Maximum Weight Trace Alignment Merging Problem. In: Martin-Vide C., Vega-Rodriguez M.A., Wheeler T. (eds) Algorithms for Computational Biology. AlCoB 2021. Lecture Notes in Computer Science, vol 12715. Springer, Cham. DOI: https://doi.org/10.1007/978-3-030-74432-8_12
E. Koning, M. Phillips, and T. Warnow (2021). pplacerDC: a New Scalable Phylogenetic Placement Method. Proceedings, 12th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (BCB ’21).
M. Gupta, P. Zaharias, and T. Warnow. (2021). Accurate Large-scale Phylogeny-Aware Alignment using BAli-Phy. Bioinformatics.
P. Zaharias, M. Grosshauser, and T. Warnow (2021). Re-evaluating Deep Neural Networks for Phylogeny Estimation: The issue of taxon sampling. Proceedings RECOMB 2021.
M. Park, P. Zaharias, and T. Warnow (2021). Disjoint Tree Mergers for Large-Scale Maximum Likelihood Tree Estimation. Algorithms. 14 (5), 148.
A. Rhie et al. (2021). Towards complete and error-free genome assemblies of all vertebrate species. Nature. 592 (7856), 737--746.
E. Wedell, Y. Cai, and T. Warnow (2021). Scalable and Accurate Phylogenetic Placement Using pplacer-XR.. International Conference on Algorithms for Computational Biology.
X. Yu, T. Le, S. Christensen, E. Molloy, and T. Warnow (2021). Using Robinson-Foulds supertrees in divide-and-conquer phylogeny estimation. Algorithms for Molecular Biology. 16 (1), 1--18
P. Zaharias, V. Smirnov, and T. Warnow (2021). The Maximum Weight Trace Alignment Merging Problem. International Conference on Algorithms for Computational Biology 2021.

Project Software:

PASTA is an improved version of SATé (Liu et al. Science 2009), which co-estimates sequence alignments and trees, and can analyze datasets with up to 1,000,000 sequences. PASTA was developed by two former PhD students of mine (Siavash Mirarab and Nam-phuong Nguyen) and has contributions now from Mike Nute (former PhD student) and Kodi Taraska (née Collins) (former REU student). See this github site.
PASTA+BAli-Phy is available at this github site, and is the work of Mike Nute (see Nute and Warnow, BMC Genomics 2016).
UPP, a new technique for multiple sequence alignment that can analyze datasets with up to 1,000,000 sequences and is highly robust to fragmentary sequences. UPP was developed by Nam Nguyen and Siavash Mirarab (current and former students of mine).
HIPPI: gene binning for protein sequences, using ensembles of HMMs. (This is available on github at the website for UPP, see above)
MAGUS: multiple sequence alignment using graph clustering (Github). MAGUS is an improvement on PASTA, in that it uses a new technique (the Graph Clustering Merger) to merge a set of disjoint alignments together. As shown in its paper (Bioinformatics 2020), it has better accuracy than PASTA and is faster. It is also related to the maximum weight trace problem, originally formulated by John Kececioglu.

Symposia and Software Schools: The grant will provide symposia and software schools to train researchers (from students through faculty) in new multiple sequence alignment methods, and other topics within phylogenomics.

2015: 2015 Phylogenomics Symposium and Software School, May 18-19, 2015, at the University of Michigan in Ann Arbor, MI, as part of the Standalone Meeting of the Society for Systematic Biologists.
2016: 2016 Phylogenomics Symposium and Software School, June 16-17, 2016, in Austin, Texas, co-located with the Evolution 2016 meeting.
2017: Advancing Genomic Biology through Novel Method Development, June 5-6, 2017, at the Radcliffe Institute for Advanced Study. This Exploratory Seminar was designed to discuss three computational problems (phylogenomics, metagenomics, and protein sequence analysis) where novel methods are needed to advance discovery in the presence of large datasets; multiple-sequence alignment methods is key to each of the three problems that were addressed.
2018: 2018 Phylogenomics Software Symposium, Institut des Sciences de l'Evolution - Montpellier (ISEM), at the University of Montpellier, August 17, 2018.
2020: 2020 Phylogenomics Software Symposium at the SSB Standalone Meeting, Systematics in the Swamp.

Presentations: See http://tandy.cs.illinois.edu/talks.html for the full list of talks.

January 12, 2016. UC San Francisco (Patsy Babbit) (PDF)
June 16, 2016. Phylogenomics Symposium, Advances in Multiple Sequence Alignment, in Austin, TX (part of Evolution 2016) (PDF)
October 11, 2016. RECOMB-CG, Montreal. Scaling statistical multiple sequence alignment to large datasets (PDF)
May 30, 2017. Keynote talk at IPDPS (IEEE International Parallel and Distributed Processing) Symposium, Orlando FL. (PDF)
August 17, 2018. Scaling statistical multiple sequence alignment methods, Phylogenomics Symposium, Montpellier France. (PDF)
February 13, 2018. Ensembles of Hidden Markov Models, Penn Bioinformatics Forum, University of Pennsylvnia.
February 13, 2019. University of Pennsylvania, Penn Bioinformatics Forum. Title: Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics
April 12-14, 2019. "Trees in the Desert 2019", a workshop on ultra-large phylogenetic trees, Tucson AZ; I talked about large-scale multiple sequence alignment methods, including PASTA and UPP
April 30, 2019. European Bioinformatics Institute, Cambridge UK. I talked about ultra-large multiple sequence alignment, including PASTA and UPP, and the use of ensembles of HMMs for various applications
June 22, 2021. SBE (Systematics, Biogeography, and Evolution). I will talk about multiple sequence alignment in the symposium on Principles, philosophy, and methodology of phylogenetic systematics.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.