Software

My research group is focused on developing open-source software for multiple sequence alignment, phylogenomics, and metagenomics, focusing on large and challenging datasets. We also have active research in methods for phylogeny estimation for natural languages, and in community detection. All software is released in open source form upon publication, usually in GitHub (clickable links provided below with the name of the software). Within each section, the software is listed in reverse chronological order (most recent software first).

Clustering

2022: AOC (Assembling Overlapping Clusters). This software takes as input a network and a set of clusters for the network, and then adds nodes to the clusters if they meet a user-specified criterion. Developer: Baqiao Liu. (paper) (github)
2021: IKC: Iterative K-Core clustering. This clustering method operates using an iterative technique that finds the smallest k-cores in a graph, removes then, and iterates. Developer: Eleanor Wedell. (paper) (github)

Multiple sequence alignment software

2022: UPP2. This is a modification of UPP, with the goal of speeding up the computation of the alignment by avoiding all-against-all comparisons of sequences to the HMMs in the ensemble of HMMs. UPP2 also improves the weight calculation for each HMM, using a technique from WITCH. Developers: Minhyuk Park and Gillian Chu. (paper) (github).
2022: WITCH-NG, a fast version of the WITCH software for aligning datasets that exhibit sequence length heterogeneity. Developer: Baqiao Liu. (paper) (github).
2021: SALMA: Scalable Alignment using MAFFT-Add, a method for aligning datsets that exhibit sequence length heterogeneity. Developer: Chengze Shen. (paper) (github)
2021: WITCH: Weighted consensus HMM alignment, a method for aligning large datasets that exhibit sequence length heterogeneity. Developer: Chengze Shen. (paper) (github)
2020: MAGUS: Multiple sequence alignment using graph clustering. This method co-estimates alignments and trees, and is an improvement on PASTA and SATé. Like its predecessors, it computes an alignment on a dataset by dividing the dataset into disjoint subsets, aligning each subset independently, and then merging the alignments. Unlike its predecessors, rather than merging the alignments through pairs, it uses a graph clustering method (Markov Clustering) to merge the alignments all at once. This change leads to improved accuracy. Developer: Vladimir Smirnov. (first paper) (second paper, using recursion) (github)
2015: UPP: Ultra-large alignments using Phylogeny-aware Profiles. This method was designed for multiple sequence alignment of datasets that have sequence length heterogeneity. The key innovation is the use of an ensemble of Hidden Markov Models (HMMs) to represent the family based on an alignment only of the full-length sequences. UPP has a strong relationship to the SEPP, TIPP, and HIPPI collection of methods. Developers: N. Nguyen and S. Mirarab. (paper) (github):
2014: PASTA: Practical Alignment using SATé and TrAnsitivity. PASTA is a direct improvement over SATé, with the improvement in accuracy, scalability, and running time. Like SATé, it uses divide-and-conquer to co-estimate alignments and trees, but it changes the way it combines the disjoint subset alignments and so improves performance. MAGUS (see above) further improves PASTA. (paper) (suppl. materials). (github)
2011: FastSP: linear-time calculation of multiple sequence alignment error (SPFN, SPFP) and accuracy (TC) scores. Developer: Siavash Mirarab. (paper) (github)
2009: SATé is a method for co-estimating multiple sequence alignments and trees. Initial developer Kevin Liu, subsequently maintained by Mark Holder. Publications: (1) Liu, K., S. Raghavan, S. Nelesen, C. R. Linder, T. Warnow, 2009. "Rapid and accurate largescale coestimation of sequence alignments and phylogenetic trees." Science, vol. 324, no. 5934, pp. 1561-1564, 19 June 2009, doi: 10.1126/science.1171243.

Disjoint Tree Mergers

2020: Guide Tree Merger, a method for combining a set of disjoint trees given a tree on the combined leafset. Developer: Vladimir Smirnov. (paper) (github)
2019: TreeMerge, a method for merging a set of disjoint trees given a distance matrix on the combined leafset. This is an improvement on an earlier method called NJMerge. Developer: Erin K. Molloy. (paper) (github)
2019: Constrained-INC, a method for merging a set of disjoint trees given a distance matrix on the combined leafset. The original paper describing the method is by Zhang, Rao, and Warnow (paper). Developer: Thien Le. (paper) github,

Species tree estimation from multi-locus datasets

2022: Quintet Rooting. This method, also called QR, takes as input an unrooted species tree and a set of unrooted gene trees, and then outputs a rooted version of the species tree, using theoretical results for the multi-species coalescent model (MSC). An improved version of QR, called QR-STAR, is statistically consistent under the MSC, and described in the second paper. Developer: Yasamin Tabatabaee. (first paper) (second paper) (github)
2022: Weighted ASTRID This is a variant of the ASTRID software for species tree estimation from gene trees, that uses weights on edges to compute the internode distance matrix. Developer: Baqiao Liu. (paper) (github)
2021: FASTRAL-constrained and NJst-constrained nstrained are methods for using FASTRAL and NJst constrained with user-provided constraints on the output species tree. These methods are thus relevant for species tree estimation from unrooted gene trees, under the multi-species coalescent model. Developer: Baqiao Liu. (paper) (github)
2021: DISCO, software for use in species tree estimation from multi-locus gene family trees, thus addressing gene duplication and loss (GDL). The key technique in DISCO is that it decomposes each gene family tree (i.e., multi-copy gene tree) into a collection of single-copy gene trees, which can then be used in standard species tree estimation pipelines. Developer: James Willson Publication: Willson J, Roddur MR, Liu B, Zaharias P, and Warnow T. "DISCO: Species Tree Inference using Multicopy Gene Family Tree Decomposition", Systematic Biology 2021 (paper) (github),
2021. FASTRAL, equivalently "Fast ASTRAL". FASTRAL is a method for species tree estimation from a set of unrooted gene trees that is statistically consistent under the MSC. FASTRAL improves the running time and memory usage of ASTRAL without impacting accuracy (and sometimes improves accuracy), and thus is relevant to species tree estimation under the multi-species coalescent (MSC) model. The key idea is to replace ASTRAL's technique for constraining the search space to a smaller search space, which lets it run more efficiently. FASTRAL uses ASTRID to compute the search space, and hence maintains statistical consistency. Developer: Payam Dibaiena. (paper) (github)
2020: FastMulRFS is software for constructing species trees from gene family trees (which can have multiple copies of each species), thus addressing gene duplication and loss. To run FastMulRFS, you replace each gene family tree by its collapsed and relabelled version, and then apply FastRFS to the collection. To compute the collapsed and relabelled version of a gene family tree with multiple copies of the species, first collapse all internal edges that do not produce bipartitions on the leaf set (i.e., collapse an internal edge if at least one species is on both sides of the edge). After collapsing the edges, if there are multiple leaves for any given species, they will all be siblings: replace such a set by just one leaf for that species. Publication: E.K. Molloy and T. Warnow. FastMulRFS: Statistically consistent polynomial time species tree estimation under gene duplication, Bioinformatics 2020 (paper)
2018: SVDquest, software to compute a species tree from a set of multiple sequence alignments, under the multi-species coalescent model. The technique uses SVDquartets (which compute quartet trees under the multi-species coalescent model, and then SVDquest solves the quartet amalgamation optimization problem exactly within a constrained search space that it computes (using maximum likelihood for each gene sequence alignment). Developer: Pranjal Vachaspati. Publication: P. Vachaspati and T. Warnow (2018). SVDquest: Improving SVDquartets species tree estimation using exact optimization within a constrained search space. Molecular Phylogenetics and Evolution, Vol. 124, pp. 122-136, DOI: 10.1016/j.ympev.2018.03.006. (paper) (github0,
2017: SIESTA, software to compute consensus trees when ASTRAL or SVDquest finds multiple optimal solutions to their optimization problems. Thus, SIESTA is an add-on for species tree estimation, under the multi-species coalescent model, to these methods. Developer: Pranjal Vachaspati. Publication: P. Vachaspati and T. Warnow (2018). SIESTA: Enhancing searches for optimal supertrees and species trees. BMC Genomics, 19(Suppl 5):252, DOI: 10.1186/s12864-018-4621-1 Special issue for selected papers from RECOMB-CG, 2017. (paper) (github),
2015: ASTRID, like ASTRAL, ASTRID is software for estimating a species tree from a set of gene trees under the multi-species coalescent (MSC) model. ASTRID, however, computes a distance matrix from the input trees and then computes a tree for the matrix using FastME (when possible) and otherwise BioNJ*. Publication: P. Vachaspati and T. Warnow. ASTRID: Accurate Species TRees from Internode Distances. RECOMB-Comparative Genomics and BMC Genomics 2015 (paper), (github).
2014: ASTRAL is software for estimating a species tree from a set of gene trees that is statistically consistent under the MSC (multi-species coalescent model). It was originally introduced in 2014, but has undergone many improvements since then. The publications listed here are the ones that PI Warnow was on; see Siavash Mirarab's website for the improvements and extensions. Developer: Siavash Mirarab. Publications: (1) S. Mirarab, R. Reaz, Md S. Bayzid, T. Zimmermann, M.S. Swenson, and T. Warnow, "ASTRAL: Genome-Scale Coalescent-Based Species Tree Estimation." Proceedings, ECCB (European Conference on Computational Biology), 2014. Also, Bioinformatics 2014 30 (17): i541-i548. doi: 10.1093/bioinformatics/btu462. (PDF) and (2) S. Mirarab and T. Warnow, "ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes", Proceedings ISMB 2015, and Bioinformatics 2015 31 (12): i44-i52 doi: 10.1093/bioinformatics/btv234 (PDF) (github)
2013: DynaDup is software to compute a species tree given a set of rooted gene trees, under gene duplication and loss scenarios. DynaDup is similar to DupTree in its objective criterion. DynaDup uses dynamic programming to solve the optimization criterion (minimize duplications and losses) within a constrained search space, just like ASTRAL. Developer: Md. S. Bayzid. Publications: (first paper), (second paper). (github)

Gene tree completion and correction

2020: TRACTION is software for resolving a non-binary gene tree using a binary reference tree (both unrooted, single copy trees), so as to minimize the resulting Robinson-Foulds distance. TRACTION can be followed with OCTAL to add the missing taxa into the gene tree. Developer: Sarah Christensen. (paper). (github)
2018: OCTAL, software for completing gene trees using a binary reference tree (both unrooted, single copy trees). OCTAL adds the missing species into the gene tree so as to minimize the resulting Robinson-Foulds distance to the reference tree. Developer: Sarah Christensen. (paper) (github),

Supertree construction

2016: FastRFS, supertree estimation based on minimizing the total Robinson-Foulds distance to the input trees. Developer: Pranjal Vachaspati. (paper) (github),
2012: SuperFine, a meta-method for improving supertree methods. The implementation at this github site enables SuperFine (Swenson et al., Systematic Biology 2012) to be used with MRP, but other supertree methods can be combined with SuperFine as well. Developer: Shel Swenson. (paper) (github),
2017: SIESTA, software to compute consensus trees when FastRFS finds multiple optimal solutions to the Robinson-Foulds Supertree optimization problems. Developer: Pranjal Vachaspati. (paper) (github),

Metagenomics

SEPP, TIPP, and HIPPI, but see also TeraTrees for the latest version of TIPP. Gillian Chu also has a forked version of this repository with some additions at (github). SEPP, TIPP, and HIPPI are three methods with a common code base that address (a) phylogenetic placement of sequences into a tree (SEPP), (b) taxonomic identification of metagenomic reads (TIPP), and (c) gene binning for molecular sequences. UPP (an alignment method) is also related to this collection. Publications: (1) Mirarab, S., N. Nguyen, and T. Warnow. "SEPP: SATe-Enabled Phylogenetic Placement." Proceedings of the 2012 Pacific Symposium on Biocomputing (PSB 2012), 17:247-258, (2) N. Nguyen, S. Mirarab, B. Liu, M. Pop, and T. Warnow "TIPP:Taxonomic Identification and Phylogenetic Profiling." Bioinformatics (2014) 30(24):3548-3555. HTML, and (3) N. Nguyen, M. Nute, S. Mirarab, and T. Warnow (2016). HIPPI: Highly accurate protein family classification with ensembles of HMMs. BMC Genomics 17 (Suppl 10):765, special issue for RECOMB-CG. (HTML) (supplement), and (4) Shah N, Molloy EK, Pop M, and Warnow T, "TIPP2: metagenomic taxonomic profiling using phylogenetic markers," Bioinformatics, 2020. (HTML)

Phylogenetic Placement

2022: SCAMPP, a divide-and-conquer technique to enable phylogenetic placement methods to run on large backbone trees. Developer: Eleanor Wedell. (paper), (github).

Other

Other software, not yet associated with papers, are available. See the "ogcat" collection by Baqiao Liu as an example: (github).