New methods for multiple sequence alignment with improved accuracy and scalability
- Tandy Warnow (PI)
(former) REU student in Statistics, now PhD student at UCLA
- Mike Nute, PhD student in Statistics (PhD expected May 2019)
- Ehsan Saleh, PhD student in Computer Science (rotation student 2017-2018)
Funding: U.S. National Science Foundation grant 1458652
Multiple sequence alignment (MSA) is one of the most basic bioinformatics steps, in which a set of molecular sequences (i.e., DNA, RNA, or amino acid sequences) are arranged inside a matrix to identify corresponding positions. MSA calculation is a fundamental first step in many biological analyses. Because of its broad applicability and importance, many MSA methods have been developed and are in wide use today. Unfortunately, many real world biological datasets have features (large size and fragmentary sequences, for example) that make accurate MSA calculation very difficult. Because poorly estimated alignments result in errors in downstream biological analyses, new MSA techniques are needed that can produce accurate alignments on difficult datasets. This project will develop MSA methods with greatly improved accuracy, and that can analyze the large and heterogeneous sequence datasets being assembled in different biology projects nationally. The project also has a substantial outreach component to women's colleges and minority serving institutions, and summer software schools to train biologists in the use of the project software.
Multiple sequence alignment (MSA) and phylogeny estimation are two very basic bioinformatics problems, which sit at the intersection of machine learning, statistical estimation, and evolutionary and structural biology. MSA has particular importance in constructing evolutionary trees, understanding the function and structure of proteins, detecting interactions between proteins, and even genome assembly. Large-scale MSA and phylogeny estimation also require high performance computing and parallel algorithms, in order to provide adequate scalability. The team will develop new machine learning techniques to greatly improve MSA methods, and hence also phylogeny estimation, since it depends on accurate multiple sequence alignments. The core of this project is algorithm development, utilizing a variety of machine learning techniques (including Hidden Markov Models), statistical estimation methods (especially Bayesian MCMC and maximum likelihood), and novel algorithmic strategies, all focused on improving scalability and accuracy.
Publications supported by this grant:
N. Nguyen, M. Nute, S. Mirarab, and T. Warnow (2016). HIPPI: Highly accurate protein family classification with ensembles of HMMs. BMC Genomics 17 (Suppl 10):765, special issue for RECOMB-CG.
M. Nute and T. Warnow (2016). Scaling statistical multiple sequence alignment to large datasets. BMC Genomics 17 (Suppl 10):764, special issue for RECOMB-CG.
B. Boyd, J.M. Allen, N. Nguyen, P. Vachaspati, Z.S. Quicksall, T. Warnow, L. Mugisha, K.P. Johnson, and D.L. Reed (2017). Primates, Lice and Bacteria: Speciation and Genome Evolution in the Symbionts of Hominid Lice. Molecular Biology and Evolution 2017; 34 (7): 1743-1757. doi: 10.1093/molbev/msx117
B.M. Boyd, J.M. Allen, N. Nguyen, A.D. Sweet, T. Warnow, M.D. Shapiro, S.M. Villa, S.E. Bush, D.H. Clayton, and K.P. Johnson (2017). Phylogenomics using Target-restricted Assembly Resolves Intra-generic Relationships of Parasitic Lice (Phthiraptera: Columbicola). Systematic Biology 2017, doi: 10.1093/sysbio/syx027
K. Collins and T. Warnow (2018).
PASTA for Proteins. Bioinformatics, https://doi.org/10.1093/bioinformatics/bty495,
Volume 34, Issue 22, 15 November 2018, Pages 3939-3941
S. Pattabiraman and T. Warnow (2018). Are Profile Hidden Markov Models Identifiable? Proceedings of ACM-BCB 2018. DOI: 10.1145/3233547.3233563.
W. Eiserhardt, A. Antonelli, D.J. Bennett, L.R. Botigue, J.G. Burleigh, S. Dodsworth, B.J. Enquist, F. Forest, J.T. Kim, A.M. Kozlov, I.J. Leitch, B.S. Maitner, S. Mirarab, W.H. Piel, O.A. Perez-Escobar, L. Pokorny, C. Rahbek, B. Sandel, S.A. Smith, A. Stamatakis, R.A. Vos, T. Warnow, and W.J. Baker (2018). A roadmap for global synthesis of the plant tree of life. American Journal of Botany, 105(3): 614-622, DOI: 10.1002/ajb2.1041, 2018
K.P. Johnson, N. Nguyen, A.D. Sweet, B.M. Boyd, T. Warnow, and J.M. Allen (2018). Simultaneous radiation of bird and mammal lice following the K-Pg boundary. Biology Letters, Vol. 14, page 20180141, DOI: 10.1098/rsbl.2018.0141.
M. Nute, E. Saleh, and T. Warnow (2019).
Evaluating statistical multiple sequence alignment in comparison to other alignment methods on protein data sets. Systematic Biology 2019,
Volume 68, Issue 3, May 2019, Pages 396-411, https://doi.org/10.1093/sysbio/syy068
improved version of SATé (Liu et al. Science 2009), which co-estimates sequence alignments and trees,
analyze datasets with up to 1,000,000 sequences.
PASTA was developed by two former students of mine
(Siavash Mirarab and
Nam Nguyen) and has
contributions now from
Mike Nute (current PhD student) and
(former REU student, now PhD student at UCLA).
See this github site.
PASTA+BAli-Phy is available at
this github site,
and is the work of Mike Nute (see Nute and Warnow, BMC Genomics 2016).
- UPP, a new technique for
multiple sequence alignment that can analyze datasets with up to 1,000,000
sequences and is highly robust to fragmentary sequences. UPP was developed
by Nam Nguyen and Siavash Mirarab (current and former students of mine).
- HIPPI: gene binning for protein sequences, using ensembles of HMMs.
(This is available on github at the website for UPP, see above)
Summer Symposia and Software Schools:
The grant will provide summer symposia and software schools to train researchers
(from students through faculty) in new multiple
sequence alignment methods, and other topics within phylogenomics.
- Summer 2015: 2015 Phylogenomics Symposium and Software School, May 18-19, 2015, at the University of Michigan in Ann Arbor, MI, as part of the Standalone Meeting of the Society for Systematic Biologists.
- Summer 2016: 2016 Phylogenomics Symposium and Software School, June 16-17, 2016, in Austin, Texas,
co-located with the Evolution 2016 meeting.
Advancing Genomic Biology through Novel Method Development,
June 5-6, 2017, at the Radcliffe Institute for Advanced Study.
This Exploratory Seminar was designed to discuss
three computational problems (phylogenomics, metagenomics,
and protein sequence analysis) where novel methods are needed to
advance discovery in the presence of large datasets; multiple-sequence
alignment methods is key to each of the three problems that were addressed.
Summer 2018: 2018 Phylogenomics Software Symposium,
Institut des Sciences de l'Evolution - Montpellier (ISEM), at the University
of Montpellier, August 17, 2018.
See http://tandy.cs.illinois.edu/talks.html for the full list of talks.
January 12, 2016. UC San Francisco (Patsy Babbit) (PDF)
- June 16, 2016. Phylogenomics Symposium, Advances in Multiple Sequence Alignment,
in Austin, TX (part of Evolution 2016)
- October 11, 2016.
RECOMB-CG, Montreal. Scaling statistical multiple sequence alignment to large datasets (PDF)
- May 30, 2017. Keynote talk at IPDPS (IEEE International Parallel and Distributed Processing) Symposium, Orlando FL.
August 17, 2018.
Scaling statistical multiple sequence alignment methods,
Symposium, Montpellier France.
February 13, 2018.
Ensembles of Hidden Markov Models, Penn Bioinformatics
Forum, University of Pennsylvnia.
February 13, 2019. University of Pennsylvania, Penn Bioinformatics Forum. Title: Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics
April 12-14, 2019. "Trees in the Desert 2019", a workshop on ultra-large phylogenetic trees, Tucson AZ; I talked about large-scale multiple
sequence alignment methods, including PASTA and UPP
April 30, 2019.
European Bioinformatics Institute, Cambridge UK.
I talked about ultra-large multiple sequence
alignment, including PASTA and UPP,
and the use of ensembles of HMMs for various applications
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.