New methods for multiple sequence alignment with improved accuracy and scalability


Funding: U.S. National Science Foundation grant 1458652 (ABI Innovation).

Project Overview: Multiple sequence alignment (MSA) is one of the most basic bioinformatics steps, in which a set of molecular sequences (i.e., DNA, RNA, or amino acid sequences) are arranged inside a matrix to identify corresponding positions. MSA calculation is a fundamental first step in many biological analyses. Because of its broad applicability and importance, many MSA methods have been developed and are in wide use today. Unfortunately, many real world biological datasets have features (large size and fragmentary sequences, for example) that make accurate MSA calculation very difficult. Because poorly estimated alignments result in errors in downstream biological analyses, new MSA techniques are needed that can produce accurate alignments on difficult datasets. This project will develop MSA methods with greatly improved accuracy, and that can analyze the large and heterogeneous sequence datasets being assembled in different biology projects nationally. The project also has a substantial outreach component to women's colleges and minority serving institutions, and summer software schools to train biologists in the use of the project software.

Multiple sequence alignment (MSA) and phylogeny estimation are two very basic bioinformatics problems, which sit at the intersection of machine learning, statistical estimation, and evolutionary and structural biology. MSA has particular importance in constructing evolutionary trees, understanding the function and structure of proteins, detecting interactions between proteins, and even genome assembly. Large-scale MSA and phylogeny estimation also require high performance computing and parallel algorithms, in order to provide adequate scalability. The team will develop new machine learning techniques to greatly improve MSA methods, and hence also phylogeny estimation, since it depends on accurate multiple sequence alignments. The core of this project is algorithm development, utilizing a variety of machine learning techniques (including Hidden Markov Models), statistical estimation methods (especially Bayesian MCMC and maximum likelihood), and novel algorithmic strategies, all focused on improving scalability and accuracy.

Publications supported by this grant:

  1. N. Nguyen, M. Nute, S. Mirarab, and T. Warnow (2016). HIPPI: Highly accurate protein family classification with ensembles of HMMs. BMC Genomics 17 (Suppl 10):765, special issue for RECOMB-CG.
  2. M. Nute and T. Warnow (2016). Scaling statistical multiple sequence alignment to large datasets. BMC Genomics 17 (Suppl 10):764, special issue for RECOMB-CG.
  3. B. Boyd, J.M. Allen, N. Nguyen, P. Vachaspati, Z.S. Quicksall, T. Warnow, L. Mugisha, K.P. Johnson, and D.L. Reed (2017). Primates, Lice and Bacteria: Speciation and Genome Evolution in the Symbionts of Hominid Lice. Molecular Biology and Evolution 2017; 34 (7): 1743-1757. doi: 10.1093/molbev/msx117
  4. B.M. Boyd, J.M. Allen, N. Nguyen, A.D. Sweet, T. Warnow, M.D. Shapiro, S.M. Villa, S.E. Bush, D.H. Clayton, and K.P. Johnson (2017). Phylogenomics using Target-restricted Assembly Resolves Intra-generic Relationships of Parasitic Lice (Phthiraptera: Columbicola). Systematic Biology 2017, doi: 10.1093/sysbio/syx027
  5. K. Collins and T. Warnow (2018). PASTA for Proteins. Bioinformatics,, Volume 34, Issue 22, 15 November 2018, Pages 3939-3941
  6. S. Pattabiraman and T. Warnow (2018). Are Profile Hidden Markov Models Identifiable? Proceedings of ACM-BCB 2018. DOI: 10.1145/3233547.3233563.
  7. W. Eiserhardt, A. Antonelli, D.J. Bennett, L.R. Botigue, J.G. Burleigh, S. Dodsworth, B.J. Enquist, F. Forest, J.T. Kim, A.M. Kozlov, I.J. Leitch, B.S. Maitner, S. Mirarab, W.H. Piel, O.A. Perez-Escobar, L. Pokorny, C. Rahbek, B. Sandel, S.A. Smith, A. Stamatakis, R.A. Vos, T. Warnow, and W.J. Baker (2018). A roadmap for global synthesis of the plant tree of life. American Journal of Botany, 105(3): 614-622, DOI: 10.1002/ajb2.1041, 2018
  8. K.P. Johnson, N. Nguyen, A.D. Sweet, B.M. Boyd, T. Warnow, and J.M. Allen (2018). Simultaneous radiation of bird and mammal lice following the K-Pg boundary. Biology Letters, Vol. 14, page 20180141, DOI: 10.1098/rsbl.2018.0141.
  9. M. Nute, E. Saleh, and T. Warnow (2019). Evaluating statistical multiple sequence alignment in comparison to other alignment methods on protein data sets. Systematic Biology 2019, Volume 68, Issue 3, May 2019, Pages 396-411,
  10. S. Pattabiraman and T. Warnow (2019). Profile Hidden Markov Models are not Identifiable. IEEE/ACM Transactions on Computational Biology, in press, DOI 10.1109/TCBB.2019.2933821.

Project Software:

Symposia and Software Schools: The grant will provide symposia and software schools to train researchers (from students through faculty) in new multiple sequence alignment methods, and other topics within phylogenomics.

Presentations: See for the full list of talks.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.