New methods for multiple sequence alignment with improved accuracy and scalability

Participants:

Funding: U.S. National Science Foundation grant 1458652, 2014-2021. (ABI Innovation).

Project Overview:

Multiple sequence alignment (MSA) and phylogeny estimation are two very basic bioinformatics problems, which sit at the intersection of machine learning, statistical estimation, and evolutionary and structural biology. MSA has particular importance in constructing evolutionary trees, understanding the function and structure of proteins, detecting interactions between proteins, and even genome assembly. Large-scale MSA and phylogeny estimation also require high performance computing and parallel algorithms, in order to provide adequate scalability. The team will develop new machine learning techniques to greatly improve MSA methods, and hence also phylogeny estimation, since it depends on accurate multiple sequence alignments. The core of this project is algorithm development, utilizing a variety of machine learning techniques (including Hidden Markov Models), statistical estimation methods (especially Bayesian MCMC and maximum likelihood), and novel algorithmic strategies, all focused on improving scalability and accuracy.

This project (2014-2021) developed new MSA methods with greatly improved accuracy and scalability to large datasets, basic algorithms for related problems (e.g., protein classification), and evaluated the impact of alignments in biological discovery. Overall, 23 publications were published supported by this grant. One of the research highlights of this project is the new method MAGUS (Smirnov and Warnow, Bioinformatics 2021), which produces more accurate alignments on large datasets than previous methods and is able to analyze very large datasets (tens of thousands of sequences). Another significant methodological advance is HIPPI (Nguyen et al., BMC Genomics 2016), a new method that performs protein sequence classification with higher accuracy than standard methods (e.g., BLAST). We also explored techniques (Nute and Warnow, BMC Genomics 2016) that enable that the popular but very computationally intensive Bayesian method, Bali-Phy (Redelings and Suchard, Systematic Biology 2005), to be used on large datasets. A surprising discovery (Nute et al., Systematic Biology 2019) is that while Bali-Phy has outstanding accuracy on simulated datasets, it is not among the most accurate methods on protein benchmark datasets; this trend raises questions about the disconnect between biological benchmarks and the model-based approach in Bali-Phy. The broader impacts of this project include five (5) free symposia and software schools held at national scientific meetings, where students, postdocs, and faculty were taught new methods and software developed by this project and by others in the research community. The research from this project was also taught in both undergraduate and graduate courses at the University of Illinois. Finally, two graduate students completed their PhDs supported by this grant, and several others were trained by this project. 14 undergraduates were trained by the project, with five supported by REU funds.

Publications supported by this grant:

  1. N. Nguyen, M. Nute, S. Mirarab, and T. Warnow (2016). HIPPI: Highly accurate protein family classification with ensembles of HMMs. BMC Genomics 17 (Suppl 10):765, special issue for RECOMB-CG. https://doi.org/10.1186/s12864-016-3097-0
  2. M. Nute and T. Warnow (2016). Scaling statistical multiple sequence alignment to large datasets. BMC Genomics 17 (Suppl 10):764, special issue for RECOMB-CG. https://doi.org/10.1186/s12864-016-3101-8
  3. J. Allen, B. Boyd, N-P. Nguyen, P. Vachaspati, T. Warnow, et al. (2017). Phylogenomics from whole genome sequences using aTRAM. Systematic Biology 66 (5), 786--798.
  4. B. Boyd, J.M. Allen, N. Nguyen, P. Vachaspati, Z.S. Quicksall, T. Warnow, L. Mugisha, K.P. Johnson, and D.L. Reed (2017). Primates, Lice and Bacteria: Speciation and Genome Evolution in the Symbionts of Hominid Lice. Molecular Biology and Evolution 2017; 34 (7): 1743-1757. doi: 10.1093/molbev/msx117
  5. B.M. Boyd, J.M. Allen, N. Nguyen, A.D. Sweet, T. Warnow, M.D. Shapiro, S.M. Villa, S.E. Bush, D.H. Clayton, and K.P. Johnson (2017). Phylogenomics using Target-restricted Assembly Resolves Intra-generic Relationships of Parasitic Lice (Phthiraptera: Columbicola). Systematic Biology 2017, doi: 10.1093/sysbio/syx027
  6. K. Collins and T. Warnow (2018). PASTA for Proteins. Bioinformatics, https://doi.org/10.1093/bioinformatics/bty495, Volume 34, Issue 22, 15 November 2018, Pages 3939-3941
  7. S. Pattabiraman and T. Warnow (2018). Are Profile Hidden Markov Models Identifiable? Proceedings of ACM-BCB 2018. DOI: 10.1145/3233547.3233563.
  8. W. Eiserhardt, A. Antonelli, D.J. Bennett, L.R. Botigue, J.G. Burleigh, S. Dodsworth, B.J. Enquist, F. Forest, J.T. Kim, A.M. Kozlov, I.J. Leitch, B.S. Maitner, S. Mirarab, W.H. Piel, O.A. Perez-Escobar, L. Pokorny, C. Rahbek, B. Sandel, S.A. Smith, A. Stamatakis, R.A. Vos, T. Warnow, and W.J. Baker (2018). A roadmap for global synthesis of the plant tree of life. American Journal of Botany, 105(3): 614-622, DOI: 10.1002/ajb2.1041, 2018
  9. K.P. Johnson, N. Nguyen, A.D. Sweet, B.M. Boyd, T. Warnow, and J.M. Allen (2018). Simultaneous radiation of bird and mammal lice following the K-Pg boundary. Biology Letters, Vol. 14, page 20180141, DOI: 10.1098/rsbl.2018.0141.
  10. M. Nute, E. Saleh, and T. Warnow (2019). Evaluating statistical multiple sequence alignment in comparison to other alignment methods on protein data sets. Systematic Biology 2019, Volume 68, Issue 3, May 2019, Pages 396-411, https://doi.org/10.1093/sysbio/syy068
  11. S. Pattabiraman and T. Warnow (2019). Profile Hidden Markov Models are not Identifiable. IEEE/ACM Transactions on Computational Biology, in press, DOI 10.1109/TCBB.2019.2933821.
  12. S. Christensen, E. Molloy, P. Vachaspati, A. Yammanuru, and T. Warnow (2020). Non-parametric correction of estimated gene trees using TRACTION. Algorithms for Molecular Biology. 15 (1), 1--18.
  13. V. Smirnov and T. Warnow (2020). MAGUS: Multiple Sequence Alignment using Graph Clustering. Bioinformatics, btaa992, https://doi.org/10.1093/bioinformatics/btaa992 (HTML)
  14. V. Smirnov and T. Warnow (2021). Phylogeny Estimation Given Sequence Length Heterogeneity. Systematic Biology, Volume 70, Issue 2, March 2021, Pages 268--282, https://doi.org/10.1093/sysbio/syaa058L
  15. P. Zaharias, V. Smirnov, and T. Warnow (2021). The Maximum Weight Trace Alignment Merging Problem. In: Martin-Vide C., Vega-Rodriguez M.A., Wheeler T. (eds) Algorithms for Computational Biology. AlCoB 2021. Lecture Notes in Computer Science, vol 12715. Springer, Cham. DOI: https://doi.org/10.1007/978-3-030-74432-8_12
  16. E. Koning, M. Phillips, and T. Warnow (2021). pplacerDC: a New Scalable Phylogenetic Placement Method. Proceedings, 12th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (BCB ’21).
  17. M. Gupta, P. Zaharias, and T. Warnow. (2021). Accurate Large-scale Phylogeny-Aware Alignment using BAli-Phy. Bioinformatics.
  18. P. Zaharias, M. Grosshauser, and T. Warnow (2021). Re-evaluating Deep Neural Networks for Phylogeny Estimation: The issue of taxon sampling. Proceedings RECOMB 2021.
  19. M. Park, P. Zaharias, and T. Warnow (2021). Disjoint Tree Mergers for Large-Scale Maximum Likelihood Tree Estimation. Algorithms. 14 (5), 148.
  20. A. Rhie et al. (2021). Towards complete and error-free genome assemblies of all vertebrate species. Nature. 592 (7856), 737--746.
  21. E. Wedell, Y. Cai, and T. Warnow (2021). Scalable and Accurate Phylogenetic Placement Using pplacer-XR.. International Conference on Algorithms for Computational Biology.
  22. X. Yu, T. Le, S. Christensen, E. Molloy, and T. Warnow (2021). Using Robinson-Foulds supertrees in divide-and-conquer phylogeny estimation. Algorithms for Molecular Biology. 16 (1), 1--18
  23. P. Zaharias, V. Smirnov, and T. Warnow (2021). The Maximum Weight Trace Alignment Merging Problem. International Conference on Algorithms for Computational Biology 2021.

Project Software:

Symposia and Software Schools: The grant will provide symposia and software schools to train researchers (from students through faculty) in new multiple sequence alignment methods, and other topics within phylogenomics.

Presentations: See http://tandy.cs.illinois.edu/talks.html for the full list of talks.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.