Graph-Theoretic Algorithms to Improve Phylogenomic Analyses
Funding: U.S. National Science Foundation grant CCF-1535977
(Algorithms in the Field), collaborative with UC Berkeley.
This grant lasted from 2014 to 2020.
The project outcomes report is available at this NSF page.
Participants at UIUC:
- Tandy Warnow (overall PI)
- Chandra Chekuri (Co-PI)
-
Sarah Christensen,
(was PhD student in Computer Science, PhD received Spring 2021)
-
Thien Le (was undergraduate student at UIUC, now PhD student
at MIT)
-
Erin Molloy
(was PhD student in Computer Science co-supervised with Bill Gropp,
PhD received Summer 2020, now Assistant Professor at Univ Maryland College Park))
-
Mike Nute
(was PhD student in Statistics, PhD received Spring 2019, now
postdoc at Rice University)
-
Pranjal Vachaspati
(was PhD student in Computer Science, PhD received Summer 2019,
now at Google in Boston)
-
Xilin Yu
(was MS student in Computer Science, MS received Spring 2019, now at
Amazon Web Services in Seattle)
Participants at Berkeley (collaborating institution):
- Satish Rao (PI)
- Aaron Sy (MS student in CS)
- Richard Zhang (PhD student in Mathematics)
Project Overview:
Understanding the history of life on earth--how species evolved from their common ancestor--is a major goal of biological research. These evolutionary trees are very hard to construct with high accuracy, because nearly all of the most accurate approaches require the solution to computationally hard optimization problems. Furthermore, research has shown that the evolutionary tree for a single gene can be different from the evolutionary tree for the species, and current methods do not provide adequate accuracy on genome-scale data. As a result, large evolutionary trees, covering big portions of "The Tree of Life", are very difficult to compute with high accuracy. This project will develop methods that can enable highly accurate species tree estimation. The key approach is the development of novel divide-and-conquer strategies, whereby a dataset is divided into overlapping subsets, species trees are constructed on the subsets, and then the subset species trees are merged together into a tree on the full dataset. These approaches will be combined with powerful statistical estimation methods, to potentially transform the capability of evolutionary biologists to analyze their data. This project will also provide open source software for the new methods that are developed, and provide training in the use of the software to biologists at national meetings. The project will also contribute to interdisciplinary training for two doctoral students, one at Illinois and one at Berkeley, and course materials for computational biology will be made available online.
Publications supported by the grant to UIUC
2016
-
P. Vachaspati and T. Warnow (2016). FastRFS: Fast and accurate Robinson-Foulds Supertrees using constrained exact optimization. Bioinformatics, special issue for RECOMB-CG, DOI: 10.1093/bioinformatics/btw600.
2017
-
P. Vachaspati and T. Warnow (2017). Enhancing searches for optimal trees using SIESTA.
In: Meidanis J., Nakhleh L. (eds) Comparative Genomics (papers from RECOMB-CG 2017),
pages 232-255, Lecture Notes in Computer Science, vol 10562. Springer, Cham
DOI: 10.1007/978-3-319-67979-2_13
-
S. Christensen, E. Molloy, P. Vachaspati, and T. Warnow (2017).
Optimal Completion of Gene Trees in Polynomial Time using OCTAL.
17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Editors: Russell Schwartz and Knut Reinert; Article No. 27; pp. 27:1-27:14, DOI: 10.4230/LIPIcs.WABI.2017.27.
2018
-
S. Christensen, E. Molloy, P. Vachaspati, and T. Warnow (2018).
OCTAL: Optimal Completion of Gene Trees in Polynomial Time. Algorithms for Molecular Biology, 13:6. https://almob.biomedcentral.com/articles/10.1186/s13015-018-0124-5.
Special issue of extended papers from WABI 2017.
DOI: 10.1186/s13015-018-0124-5.
-
P. Vachaspati and T. Warnow (2018). SIESTA: Enhancing searches for optimal supertrees
and species trees. BMC Genomics, 19(Suppl 5):252,
Special issue of extended papers from RECOMB-CG, 2017.
DOI: 10.1186/s12864-018-4621-1.
-
P. Vachaspati and T. Warnow (2018).
SVDquest: Improving SVDquartets species tree estimation using exact optimization within a constrained search space. Molecular Phylogenetics and Evolution
Vol. 124, pp. 122-136,
DOI: 10.1016/j.ympev.2018.03.006
-
M. Nute, E. Molloy, J. Chou, and T. Warnow (2018). The Performance of Coalescent-Based Species Tree Estimation Methods under Models of Missing Data. BMC Genomics, 19(Suppl 5):286,
Special issue for selected papers from RECOMB-CG, 2017.
DOI: 10.1186/s12864-018-4619-8.
-
E. Molloy and T. Warnow (2018)
To Include or Not to Include: The Impact of Gene Filtering on Species Tree Estimation Methods.
Systematic Biology, Volume 67, Issue 2, 1 March 2018, pages 285-303.
DOI: https://doi.org/10.1093/sysbio/syx077
-
Md S. Bayzid and T. Warnow (2018).
Gene tree parsimony for incomplete gene trees: addressing true
biological loss. Algorithms for Molecular Biology, 13:1, special issue for selected papers from WABI 2017.
DOI: 10.1186/s13015-017-0120-1.
-
E. Molloy and T. Warnow (2018). NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees. In: Blanchette M., Ouangraoua A. (eds) Comparative Genomics. RECOMB-CG 2018. Lecture Notes in Computer Science, vol 11183. Springer, Cham. Proceedings of RECOMB-CG (Satellite conference for Comparative Genomics).
DOI: 10.1007/978-3-030-00834-5_15
-
R. Zhang, S. Rao, and T. Warnow (2018). New absolute fast converging phylogeny estimation methods with improved scalability and accuracy. Proceedings of WABI (Workshop on Algorithms for Bioinformatics) 2018.
DOI: 10.4230/LIPIcs.WABI.2018
2019
-
S. Roch, M. Nute, and T. Warnow (2019). Long-branch attraction in species tree estimation: inconsistency of partitioned likelihood and topology-based summary methods. Systematic Biology,
Volume 68, Issue 2, 1 March 2019, Pages 281-297.
DOI: 10.1093/sysbio/syy061
-
Q. Zhang, S. Rao, and T. Warnow (2019). Constrained incremental tree building: new absolute fast converging phylogeny estimation methods with improved scalability and accuracy. Algorithms for Molecular Biology, 14(2), 2019. https://rdcu.be/blBXm.
DOI: 10.1186/s13015-019-0136-9
-
S. Christensen, E. Molloy, P. Vachaspati, and T. Warnow (2019).
TRACTION: Fast non-parametric improvement of estimated gene trees. Proceedings of WABI 2019, LIPICS volume 143, pages 4:1-4:16, 2019.
-
T. Le, A. Sy, E. Molloy, Q. Zhang, S. Rao and T. Warnow. Using INC within Divide-and- Conquer Phylogeny Estimation. Algorithms for Computational Biology, Sixth International Conference, Berkeley, CA, May 28-30, 2019; LNCS/LNBI, Chapter 12, DOI:10.1007/978-3-030-18174-1\_12.
-
E. Molloy and T. Warnow (2019). TreeMerge: A new method for improving the scalability of species tree estimation methods. Bioinformatics, special issue for Intelligent Systems for Molecular Biology (ISMB) 2019, Volume 35, Issue 14, July 2019, Pages i417--i426.
DOI: 10.1093/bioinformatics/btz344
-
V. Smirnov and T. Warnow (2019). Fast and Highly Accurate Species Trees through Divide-and-Conquer using GTM. Proceedings of RECOMB-CG 2019.
2020
-
B. Legried, E.K. Molloy, T. Warnow, and S. Roch (2020).
"Polynomial-Time Statistical Estimation of Species Trees under Gene Duplication and Loss." RECOMB 2020, pp. 120--135. Preprint available at bioRxiv:821439
DOI: 10.1007/978-3-030-45257-5_8
- S.A. Christensen, E.K. Molloy, P. Vachaspati, A. Yammanuru, and T. Warnow (2020). Non-parametric correction of estimated gene trees using TRACTION. Algorithms for Molecular Biology, 15 (1).
DOI: 10.1186/s13015-019-0161-8
- V. Smirnov and T. Warnow (2020).
"Unblended Disjoint Tree Merging using GTM improves species tree estimation." BMC Genomics (special issue of selected papers from RECOMB-CG), Volume 21 (Suppl 2): 235, https://doi.org/10.1186/s12864-020-6605-1.
DOI: 10.1186/s12864-020-6605-1.
-
T. Le, E.K. Molloy, Q. Zhang, S. Rao, and T. Warnow (2020).
"Using Constrained-INC for Large-scale Gene Tree and Species Tree Estimation." IEEE/ACM Transactions on Computational Biology and Bioinformatics. (HTML).
DOI: 10.1109/TCBB.2020.2990867
-
X. Yu, T. Le, S. Christensen, E.K. Molloy, and T. Warnow (2020).
"Advancing Divide-and-Conquer Phylogeny Estimation using Robinson-Foulds Supertrees." Proceedings, WABI 2020. Available as biorxiv 2020.05.16.099895.
-
E.K. Molloy and T. Warnow (2020).
"FastMulRFS: Fast and accurate species tree estimation under generic gene duplication and loss models." ISMB 2020 and Bioinformatics, Vol. 36, pages i57-i65,
DOI:10.1093/bioinformatics/btaa444
Project Software:
-
FastRFS, a fast supertree method.
Open source software
on Github.
FastRFS is a dynamic programming algorithm that finds an
exact solution to the Robinson-Foulds Supertree problem within
a constrained search space. This study shows that
FastRFS is very fast, and can analyze very large datasets with
thousands of taxa. In addition, FastRFS is more
accurate than major alternative supertree methods.
-
OCTAL, open source software
on Github.
OCTAL (and its improved version,
TRACTION) is a method for gene tree correction,
given a reference species tree, that is neutral to
the cause for gene tree discord.
OCTAL and TRACTION are limited to gene trees with
at most one copy of any species.
-
SIESTA, open source software
on Github.
SIESTA
is a method for computing
statistics and consensus trees for
species tree methods (such as ASTRAL,
FastRFS, and SVDquest) that operate
using dynamic programming.
-
SVDquest, open source software
on Github.
SVDquest
is a method for exactly solving the maximum
quartet compatibility problem
within a constrained search space.
-
NJMerge, open source software
on Github.
NJMerge is software for merging a set of leaf-disjoint
trees into a tree on the full dataset, using
an estimated distance matrix relating all the species.
-
INC (Incremental tree building) and Constrained INC, open source
software
on Github.
INC is software for an absolute fast converging
algorithm that constructs a tree from a distance matrix.
Constrained INC is software for merging a set of leaf-disjoint
trees into a tree on the full dataset, using
aligned sequences for the species.
Summer Symposia and Software Schools:
The grant provided summer symposia and software schools to train researchers
(from students through faculty) in phylogenomic estimation,
which is a main focus of this grant.
- Summer 2015: 2015 Phylogenomics Symposium and Software School, May 18-19, 2015, at the University of Michigan in Ann Arbor, MI, as part of the Standalone Meeting of the Society for Systematic Biologists.
- Summer 2016: 2016 Phylogenomics Symposium and Software School, June 16-17, 2016, in Austin, Texas,
co-located with the Evolution 2016 meeting.
-
Summer 2017:
Advancing Genomic Biology through Novel Method Development,
June 5-6, 2017, at the Radcliffe Institute for Advanced Study.
This Exploratory Seminar was designed to discuss
three computational problems (phylogenomics, metagenomics,
and protein sequence analysis) where novel methods are needed to
advance discovery in the presence of large datasets; multiple-sequence
alignment methods is key to each of the three problems that were addressed.
-
Summer 2018: 2018 Phylogenomics Software Symposium,
Institut des Sciences de l'Evolution - Montpellier (ISEM), at the University
of Montpellier, August 17, 2018.
Presentations:
See http://tandy.cs.illinois.edu/talks.html for the full list of talks.
2015
- October 3, 2015. AMS Sectional, Loyola University.
"New methods for estimating species trees from
genome-scale data"
(PPTX)
(PDF)
- October 16, 2015. Program for Evolutionary Dynamics,
Harvard University.
(PPTX)
(PDF)
- November 9, 2015. UCSD Distinguished Lecture,
Department of Computer Science.
(PPT)
(PDF)
- December 4, 2015. MIT
(PPT)
(PDF)
2016
- January 13, UC Berkeley (Integrative Biology)
(PDF)
- January 15, UC Davis (Genomics Institute)
(PDF)
-
February 3, University of Washington (Combi seminar)
(PDF)
-
February 5, Information Theory and Applications,
La Jolla, CA (Plenary talk)
(PDF)
(PPT)
-
February 29 to March 1,
CEHG
Symposium at Stanford University
(PDF)
-
March 8, PhyloPizza, Smithsonian Institute
(JPG)
(PDF)
-
April 18-23, 2016.
Co-evolution in proteins and
RNA, theory and experiments, Cargese, Corsica
(PPTX)
(PDF)
-
May 9-13,
Molecules as documents of
evolutionary history: 50 years after
Roscoff (Brittany), France
(PDF)
-
June 6-10, SIAM Conference Discrete Math,
Georgia State University, Atlanta, Georgia, USA.
(PDF)
-
June 13-15, 2016. Blue Waters Symposium, Oregon.
(PDF)
(PPTX)
-
June 21, 2016.
Austin, Texas.
Evolution 2016 meeting.
New methods for coalescent-based species tree
estimation: ASTRAL, ASTRID, and statistical binning.
(PDF)
-
August 2, 2016.
New methods for coalescent-based species tree
estimation: ASTRAL, ASTRID, and statistical binning
and
How to construct large trees,
University of Trento, Italy.
-
August 25, 2016.
New methods for coalescent-based species tree
estimation: ASTRAL, ASTRID, and statistical binning,
University of Oxford, Statistics.
-
September 16, 2016.
CMU Computational Biology.
(PDF)
-
October 3, 2016.
U Penn Math Department.
(PDF)
-
October 5, 2016.
Princeton CS Department.
(PDF)
(PPT)
-
October 17, 2016.
Georgia Tech, CSE Department Distinguished Lecture.
Genome-scale estimation of the Tree of Life.
(PDF)
(PPT)
-
October 18, 2016.
Georgia Tech, Math Department IMPACT Distinguished Lecture.
Genome-scale estimation of the Tree of Life.
(PDF)
(PPT)
-
October 18, 2016.
Georgia Tech, Math Department seminar.
Constrained exact optimization in phylogenetics.
(PDF)
(PPTX)
-
October 31, 2016.
Statistics Department colloquium, University of Chicago.
(PDF)
(PPT)
-
November 11, 2016.
NYU CS department.
(PDF)
(PPT)
2017
2018
- January 5, 2018.
Hunter College, (PPTX)
- April 4, 2018. Pathobiology, UIUC
(PPTX)
- April 19, 2018. Imperial College, London.
Colloquium, Department of Mathematics.
(PDF)
- April 23, 2018, Keynote talk, RECOMB (Paris, France)
(PDF)
- May 8, 2018. CRA-W Distinguished Lecture, UCLA Computer Science.
(PDF)
- June 2-3, 2018,
Standalone SSB meeting, Ohio State University: ASTRID
- June 4, 2018, Workshop on Species tree estimation at the
Standalone SSB meeting,
Ohio State University.
I gave a tutorial on ASTRAL.
- June 7, 2018.
Foundations of Data Science at the
SIAM Conference on Discrete Mathematics,
Denver, Colorado.
(PDF)
(PPT)
- August 13, 2018. Oxford University, Department of
Statistics,
Theoretical and Empirical Advances in Species Tree Estimation
(PDF)
(PPT)
- September 13 and 14, 2018.
Advances in Phylogenomics, Genome 10K conference
(PDF)
(PPT)
- October 29, 2018. Northwestern University.
(PDF)
(PPT)
- November 7, 2018, IPAM (at UCLA)
(PDF)
(PPT)
- November 16, 2018. Kew Royal Botanical Gardens, London.
(PDF)
(PPT)
2019
- April 30, 2019. European Bioinformatics Institute, Cambridge UK, Invited talk.
- May 10, 2019. University of Indiana at Bloomington, Distinguished Lecture, Department of Computer Science
- May 13-15, NIMBIOS, Knoxville TN, Invited talk.
-
May 28-30, 2019, Invited talk,
The 6th International Conference on Algorithms for Computational Biology
(AlCoB 2019), Berkeley, CA.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.