CS 581: Final Project (suggestions)

CS 581 (Fall 2022) Final Project Suggestions

The textbook (Computational Phylogenetics: An Introduction to Designing Methods for Phylogeny Estimation) has several suggestions for projects, many of which would be good for this course. You may also have your own ideas for a project! The suggestions below are just to get you started with thinking about what you might do. (Note, however, that some projects--especially those that evaluate and explore Bayesian methods, may be too computationally intensive to do as a course project, and would be better as a post-581 project.) Also, please note that the course project is evaluated with respect to writing (including style, low-level writing, and reproducibility) as well as the content. Writing counts 25% of the grade!

Specific projects of interest to me (TW)

Projects related to TIPP (https://doi.org/10.1093/bioinformatics/btu721)
- Evaluate taxon identification methods, comparing methods that place into taxonomies to methods that place into a gene tree and then inherit the taxonomic information from the leaf labels.
Projects related to gene and species tree estimation
- Test GTM (https://doi.org/10.1186/s12864-020-6605-1) in a pipeline with methods other than maximum likelihood (e.g., neighbor joining or FastME).
- Test GTM (https://doi.org/10.1186/s12864-020-6605-1) and TreeMerge (https://doi.org/10.1093/bioinformatics/btz344), two Disjoint Tree Merger methods, under conditions with different model tree topologies (balanced vs. caterpillar). More generally, see if you can find a condition where GTM is less accurate than TreeMerge.
- Develop a new Disjoint Tree Merger pipeline and test it in comparison to GTM
- Develop a new quartet amalgamation method and compare to Quartets Max Cut (DOI: 10.1109/TCBB.2008.133)
- Evaluate impact of species tree shape on methods for estimating trees or rooting trees.
- Evaluate methods for generating random trees in terms of their tree shapes (and see previous item).
- Develop techniques for rooting gene trees with respect to species trees that are not fully resolved (expand on ideas in NOTUNG)
- Evaluate methods for tree construction in the presence of rogue taxa (that are nevertheless homologs)
Projects related to ensembles of HMMs
- Develop a rigorous way of designing ensembles of profile HMMs (think cross-validation)
- Extend design or use of ensemble of HMMs in some interesting way (e.g., use some other generative model, or define a mixture model)
Projects related to multiple sequence alignment (MSA)
- Modify the MAGUS design by using a different clustering method instead of the Markov Clustering algorithm.
- Evaluate UPP and other existing MSA methods on datasets with sequence length heterogeneity where some sequences are very long (and possibly design a new method that addresses this problem better)
- Develop and test methods for merging three or more disjoint alignments of different lengths (compare to GCM in MAGUS)
- Test Prank in comparison to MAFFT as evolutionary diameter and deviation from a clock increase, when given a guide tree
- Evaluate Prank or Probcons within PASTA or MAGUS, and compare to default MAGUS and PASTA
- Compare Regressive (https://doi.org/10.1038/s41587-019-0333-6) to PASTA and MAGUS
Parallel algorithms projects:
- Parallelize UPP (multiple sequence alignment, by Nam-phuong Nguyen et al.)
- Parallelize GTM (Guide Tree Merger, by Vlad Smirnov and me)
- Parallelize Balanced Minimum Evolution within FastME
Historical linguistics
- Run maximum likelihood methods on simulated data for language evolution (check with me)

Projects related to multiple sequence alignment

Modify the sequence evolution simulator Indelible (Fletcher and Yang) so that it allows the root sequence to be specified. The code is open source and written in C++.
Compare methods that can align two or more alignments, and evaluate for accuracy on simulated datasets. Examples of such methods: Opal, Muscle, Prime, and GCM.
Test POY using gap penalties that are not affine gap penalties for accuracy.
Examine ways of computing a consensus alignment on the output from PASTA, and evaluate for accuracy in comparison to the single alignment returned by PASTA. For example, compute the posterior decoding of a random sample of the alignments PASTA computes (after removing the first few alignments). To compute the posterior decoding, you can install BAli-Phy (Redelings and Suchard), using the directions on the website: http://balip-phy.org/README.html#installation. Once the software is installed, there is a folder of executable files called "bin", and within that folder there is a folder called "alignment-max".
Examine ways to annotate the sites in a multiple sequence alignment for reliability, based on examining a set of alternate alignments obtained by running a basic alignment method using different strategies (e.g., different parameter settings).
Find codes for computing a point estimate of an alignment given an arbitrary set of multiple sequence alignments (e.g., the posterior decoding algorithm in BAli-Phy, but also look at T-Coffee), and compare them for accuracy and/or scalability.
Explore BAli-Phy to see if you can improve it. For example, determine if you can use it to find a tree given a fixed alignment. or to find the best alignment on a fixed tree.
In "Benchmarking statistical multiple sequence alignment" (bioRxiv doi: https://doi.org/10.1101/304659), my students and I showed a strange trend in which BAli-Phy did very well on simulated amino acid datasets but not very well on biological amino acid datasets. Try to do this evaluation on nucleotide datasets.
Evaluate the impact of violating the strict molecular clock on either multiple sequence alignment or phylogenetic tree estimation, especially when using statistical methods (such as BAli-Phy, Prank, etc.).

Projects related to supertree estimation or phylogenomic species tree estimation

Develop a good method for weighted quartet tree amalgamation. Compare to Weighted Quartets MaxCut by Avni et al. (2014), DOI: 10.1093/sysbio/syu087.
Evaluate Weighted Quartets MaxCut (Avni et al. 2014) as a supertree method on the SMIDgen datasets (Swenson et al.).
Implement a parallel version of some good supertree method, such as FastRFS (Vachaspati and Warnow), Quartets MaxCut (Snir and Rao), etc.
Quartet-based supertree methods have computational limitations in that they depend on computing all quartet trees. Evaluate variants that only require computing a subset of the quartet trees.
Evaluate the impact of multiple sequence alignment error on SVDquartets (Chifman and Kubatko, as implemented in PAUP*).

Projects related to maximum likelihood gene tree estimation

Evaluate the impact of the starting tree on FastTree-2.
ML methods have been explored in terms of the impact of the tree topology or the ML score, and in general RAxML has been found to be better than other methods for ML score but only rarely better than FastTree-2 in terms of tree topology. Try to characterize the conditions in which RAxML is generally more accurate than FastTree-2 (or other ML methods) for tree topology.
Modify some maximum likelihood method to take support values per site into account.
Develop a new Disjoint Tree Merger (DTM) method, and test it in comparison to GTM.
Evaluate the impact of violating the strict molecular clock on on phylogenetic tree estimation, especially when using statistical methods such as BAli-Phy.
Evaluate the impact of tree shape (e.g., the Aldous beta-splitting model) on tree estimation.

Other

Develop a method that can compute a tree from an incomplete dissimilarity matrix (i.e., matrices where some of the entries do not have values). Compare to the "NJ*" methods from Criscuolo and Gascuel (2008), https://doi.org/10.1186/1471-2105-9-166, on some simulated datasets.
Find an application area (other than biological phylogenetics) where tree estimation (especially of large datasets) is important and not necessarily solved well, and where accuracy can be measured. Try to develop a new approach to tree estimation, and compare to the current best methods.
Explore the "ensemble of profile HMMs" approach with some other generative model or some other biological application
Develop a method that performs taxon identification of sequences that takes the following as input: a taxonomy T and a tree T' in which all but one of the leaves is assigned a taxonomic label using T. Test your method where T' is produced by adding a single short sequence (i.e., read) into an estimated maximum likelihood tree on full length sequences for a single gene. Thus, your method will rely on a phylogenetic placement method.
Develop a divide-and-conquer approach to phylogenetic network estimation, and test it.
Test divide-and-conquer DTM methods for use with genome rearrangement phylogeny estimation methods.
Find some published dataset and re-analyze it using the new methods for phylogeny estimation (either gene trees or species trees).