Software
My research group is focused on developing
open-source software for multiple sequence alignment, phylogenomics,
and metagenomics, focusing on large and challenging datasets.
We also have active research in methods for phylogeny estimation for natural
languages, and in community detection.
All software is released in open source form upon publication, usually in GitHub
(clickable links provided below with the name of the software).
Within each section, the software is listed in reverse chronological order (most
recent software first).
Clustering
- 2022:
AOC
(Assembling Overlapping Clusters).
This software takes as input a network and a set of clusters for the network,
and then adds nodes to the clusters if they meet a user-specified
criterion.
Developer: Baqiao Liu.
(paper)
(github)
- 2021:
IKC: Iterative K-Core clustering.
This clustering method operates using an iterative technique that finds the smallest k-cores in
a graph, removes then, and iterates.
Developer: Eleanor Wedell.
(paper)
(github)
Multiple sequence alignment software
- 2023:
EMMA.
This code is designed for adding sequences into a multiple sequence alignment.
It provides high accuracy, better than other methods, for the case
where the added sequences are generally close
to full-length.
Developer: Chengze Shen.
paper.
(paper).
(github).
- 2022:
UPP2.
This is a modification of UPP, with the
goal of speeding up the computation of the alignment by avoiding
all-against-all comparisons of sequences to the HMMs in the ensemble of HMMs.
UPP2 also improves the weight calculation for each HMM, using a technique from WITCH.
Developers: Minhyuk Park and Gillian Chu.
(paper)
(github).
-
2022:
WITCH-NG,
a fast version of the WITCH software for
aligning datasets that exhibit sequence length heterogeneity.
Developer: Baqiao Liu.
(paper)
(github).
-
2021:
WITCH:
Weighted consensus HMM alignment,
a method for aligning large datasets that exhibit
sequence length heterogeneity.
Developer: Chengze Shen.
(paper)
(github)
-
2020:
MAGUS: Multiple sequence alignment
using graph clustering.
This method co-estimates alignments and trees, and is an improvement on PASTA and SATé.
Like its predecessors, it computes an alignment on a dataset by dividing the dataset into
disjoint subsets, aligning each subset independently, and then merging the alignments.
Unlike its predecessors, rather than merging the alignments through pairs, it uses
a graph clustering method (Markov Clustering) to merge the alignments all at once.
This change leads to improved accuracy.
Developer: Vladimir Smirnov.
(first paper)
(second paper, using recursion)
(github)
-
2015:
UPP:
Ultra-large alignments using Phylogeny-aware Profiles.
This method was designed for multiple sequence alignment of datasets
that have sequence length heterogeneity. The key innovation is the use
of an
ensemble of Hidden Markov Models (HMMs) to represent the family based
on an alignment only of the full-length sequences.
UPP has a strong relationship to the SEPP, TIPP, and HIPPI collection of
methods.
Developers: N. Nguyen and S. Mirarab.
(paper)
(github):
- 2014:
PASTA: Practical Alignment
using SATé and TrAnsitivity.
PASTA is a direct improvement over SATé, with the improvement in
accuracy, scalability, and running time. Like SATé, it uses divide-and-conquer to co-estimate
alignments and trees, but it changes the way it combines
the disjoint subset alignments and so improves performance.
MAGUS (see above) further improves PASTA.
(paper)
(suppl. materials).
(github)
-
2011:
FastSP:
linear-time calculation
of multiple sequence alignment error (SPFN, SPFP) and accuracy (TC) scores.
Developer: Siavash Mirarab.
(paper)
(github)
-
2009:
SATé is a method for
co-estimating multiple sequence alignments and trees.
Initial developer Kevin Liu, subsequently maintained by Mark Holder.
Publications:
(1) Liu, K., S. Raghavan, S. Nelesen, C. R. Linder, T. Warnow, 2009.
"Rapid and accurate largescale
coestimation of sequence alignments and phylogenetic trees."
Science,
vol. 324, no. 5934,
pp. 1561-1564, 19 June 2009, doi: 10.1126/science.1171243.
Disjoint Tree Mergers
Species tree estimation from multi-locus datasets
- 2022:
Quintet Rooting.
This method, also called QR,
takes as input an unrooted species tree and a set of unrooted gene trees,
and then outputs a rooted version of the species tree, using theoretical
results for the multi-species coalescent model (MSC).
An improved version of QR, called QR-STAR, is statistically consistent under the MSC,
and described in the second paper.
Developer:
Yasamin Tabatabaee.
(first paper)
(second paper)
(github)
- 2022:
Weighted ASTRID
This is a variant of the ASTRID software for species tree estimation
from gene trees, that uses weights on edges to compute the
internode distance matrix.
Developer: Baqiao Liu.
(paper)
(github)
- 2021:
FASTRAL-constrained
and
NJst-constrained
nstrained are methods for using FASTRAL and NJst constrained
with user-provided constraints on the output species tree.
These methods are
thus relevant for species tree estimation from unrooted gene trees, under the multi-species coalescent model.
Developer: Baqiao Liu.
(paper)
(github)
- 2021:
DISCO,
software for use in species tree estimation
from multi-locus gene family trees, thus addressing gene duplication and loss (GDL).
The key technique in DISCO is that it decomposes
each gene family tree (i.e., multi-copy gene tree)
into a collection of single-copy gene trees,
which can then be used in standard species tree estimation
pipelines.
Developer: James Willson
Publication:
Willson J, Roddur MR, Liu B, Zaharias P, and Warnow T.
"DISCO: Species Tree Inference using Multicopy Gene Family Tree Decomposition",
Systematic Biology 2021
(paper)
(github),
- 2021.
FASTRAL, equivalently
"Fast ASTRAL".
FASTRAL is a method for species tree estimation from a set of unrooted gene trees
that is statistically consistent under the MSC.
FASTRAL improves the running time and memory usage of ASTRAL without
impacting accuracy (and sometimes improves accuracy), and thus is
relevant to species tree estimation under the multi-species coalescent (MSC) model.
The key idea is to replace ASTRAL's technique for constraining the search space
to a smaller search space, which lets it run more efficiently.
FASTRAL uses ASTRID to compute the search space, and hence
maintains statistical consistency.
Developer: Payam Dibaiena.
(paper)
(github)
- 2020:
FastMulRFS
is software for constructing species trees from
gene family trees (which can have multiple
copies of each species), thus addressing gene duplication and loss.
To run FastMulRFS, you replace each gene family tree by its
collapsed and relabelled version, and then
apply FastRFS
to the collection.
To compute the collapsed and relabelled version of a gene
family tree with multiple copies of the species, first
collapse all internal edges that do not
produce bipartitions on the leaf set (i.e., collapse an internal edge if
at least
one
species is on both sides of the edge).
After collapsing the edges, if there are multiple
leaves for any given
species, they will all be siblings: replace such a set by just one
leaf for that species.
Publication:
E.K. Molloy and T. Warnow.
FastMulRFS: Statistically consistent polynomial time species tree estimation under gene duplication,
Bioinformatics 2020
(paper)
- 2018:
SVDquest,
software to compute a species tree from a set of multiple sequence alignments, under the multi-species coalescent model.
The technique uses
SVDquartets (which compute quartet trees under the multi-species coalescent model, and then
SVDquest solves the quartet amalgamation optimization
problem exactly within a constrained search space that it computes (using maximum
likelihood for each gene sequence alignment).
Developer: Pranjal Vachaspati.
Publication: P. Vachaspati and T. Warnow (2018).
SVDquest: Improving SVDquartets species tree estimation using exact optimization within a constrained search space.
Molecular Phylogenetics and Evolution, Vol. 124, pp. 122-136,
DOI: 10.1016/j.ympev.2018.03.006.
(paper)
(github0,
- 2017:
SIESTA,
software to compute consensus trees when ASTRAL or SVDquest finds
multiple optimal solutions to their optimization problems.
Thus, SIESTA is an add-on for species tree estimation, under the multi-species coalescent model, to these
methods.
Developer: Pranjal Vachaspati.
Publication: P. Vachaspati and T. Warnow (2018).
SIESTA: Enhancing searches for optimal supertrees and species trees.
BMC Genomics, 19(Suppl 5):252,
DOI: 10.1186/s12864-018-4621-1
Special issue for selected papers from RECOMB-CG, 2017.
(paper) (github),
- 2015:
ASTRID, like
ASTRAL, ASTRID is software
for estimating a species tree from a set of gene
trees under the multi-species coalescent (MSC) model. ASTRID, however,
computes a distance matrix from the input trees and then
computes a tree for the matrix using FastME (when possible)
and otherwise BioNJ*.
Publication: P. Vachaspati and T. Warnow.
ASTRID: Accurate Species TRees from Internode
Distances.
RECOMB-Comparative Genomics and BMC Genomics 2015
(paper),
(github).
-
2014:
ASTRAL
is software for estimating
a species tree from a set of gene trees that is statistically
consistent under the MSC (multi-species coalescent model).
It was originally introduced in 2014, but has undergone many improvements
since then.
The publications listed here are the ones that PI Warnow was on; see
Siavash Mirarab's website for the improvements and extensions.
Developer: Siavash Mirarab.
Publications: (1) S. Mirarab, R. Reaz, Md S. Bayzid, T. Zimmermann, M.S. Swenson,
and T. Warnow,
"ASTRAL: Genome-Scale Coalescent-Based Species Tree Estimation."
Proceedings, ECCB (European Conference on
Computational Biology), 2014.
Also, Bioinformatics 2014 30 (17): i541-i548.
doi: 10.1093/bioinformatics/btu462.
(PDF)
and
(2)
S. Mirarab and T. Warnow,
"ASTRAL-II: coalescent-based species tree estimation
with many hundreds of taxa and thousands of genes",
Proceedings ISMB 2015, and
Bioinformatics
2015 31 (12): i44-i52
doi: 10.1093/bioinformatics/btv234
(PDF)
(github)
- 2013:
DynaDup
is software to compute a species tree given a set of rooted gene
trees, under gene duplication and loss scenarios.
DynaDup is similar to DupTree in its objective
criterion.
DynaDup uses dynamic programming to solve the optimization
criterion (minimize duplications and losses)
within a constrained search space, just like
ASTRAL.
Developer: Md. S. Bayzid.
Publications:
(first paper),
(second paper).
(github)
Gene tree completion and correction
- 2020:
TRACTION is software for resolving a non-binary
gene tree using a binary reference tree (both unrooted, single
copy trees), so as to minimize the resulting
Robinson-Foulds distance.
TRACTION can be followed with OCTAL to add the missing taxa into the
gene tree.
Developer: Sarah Christensen.
(paper).
(github)
- 2018:
OCTAL,
software for completing gene trees using a binary reference tree (both unrooted,
single copy trees).
OCTAL adds the missing species into the gene tree so as to
minimize the resulting Robinson-Foulds distance to the
reference tree.
Developer: Sarah Christensen.
(paper)
(github),
Supertree construction
- 2016:
FastRFS,
supertree estimation based on minimizing the
total Robinson-Foulds distance to the input trees.
Developer: Pranjal Vachaspati.
(paper)
(github),
- 2012:
SuperFine,
a meta-method for improving supertree methods. The
implementation at this github site enables SuperFine
(Swenson et al., Systematic Biology 2012) to be used
with MRP, but other supertree methods can be combined
with SuperFine as well.
Developer: Shel Swenson.
(paper)
(github),
-
2017:
SIESTA,
software to compute consensus trees when FastRFS finds
multiple optimal solutions to the Robinson-Foulds Supertree optimization problems.
Developer: Pranjal Vachaspati.
(paper)
(github),
Metagenomics
-
SEPP, TIPP, and HIPPI,
but see also TeraTrees for
the latest version of TIPP. Gillian Chu also has a forked version
of this repository with some additions at (github).
SEPP, TIPP, and HIPPI are
three methods
with a common code base that address (a) phylogenetic
placement of sequences into a tree (SEPP), (b) taxonomic identification of metagenomic
reads (TIPP), and (c) gene binning for molecular sequences.
UPP (an alignment method) is also related to this collection.
Publications: (1) Mirarab, S., N. Nguyen, and T. Warnow.
"SEPP: SATe-Enabled Phylogenetic
Placement."
Proceedings of the 2012 Pacific Symposium on Biocomputing
(PSB 2012), 17:247-258,
(2)
N. Nguyen, S. Mirarab, B. Liu, M. Pop, and T. Warnow
"TIPP:Taxonomic Identification and Phylogenetic Profiling."
Bioinformatics (2014)
30(24):3548-3555.
HTML, and
(3)
N. Nguyen, M. Nute, S. Mirarab, and T. Warnow (2016).
HIPPI: Highly accurate protein family classification with ensembles of HMMs.
BMC Genomics 17 (Suppl 10):765, special issue for RECOMB-CG.
(HTML)
(supplement),
and
(4)
Shah N, Molloy EK, Pop M, and Warnow T, "TIPP2: metagenomic taxonomic profiling using phylogenetic markers," Bioinformatics, 2020.
(HTML)
Phylogenetic Placement
- 2025: BATCH-SCAMPP.
This is an improvement on SCAMPP, achieving nearly the same
accuracy but with much reduced runtime.
Developers: Eleanor Wedell and Chengze Shen.
(paper).
(github).
- 2022:
SCAMPP, a divide-and-conquer
technique to enable phylogenetic placement
methods to run on large backbone trees.
Developer: Eleanor Wedell.
(paper),
(github).
Other
-
Other software, not yet associated with papers, are
available.
See the "ogcat" collection by Baqiao Liu as an example:
(github).