CS 581 Final Project Suggestions
The textbook (Computational Phylogenetics: An Introduction
to Designing Methods for Phylogeny Estimation) has several
projects, many of which would be good for this course.
You may also have your own ideas for a project!
The suggestions below are just to get you started with thinking
about what you might do.
Projects related to multiple sequence alignment
- Modify the sequence evolution simulator Indelible
(Fletcher and Yang) so that it allows
the root sequence to be specified.
The code is open source and written in C++.
- Compare methods that can align two alignments, and evaluate
for accuracy on simulated datasets.
Examples of such methods: Opal, Muscle, and Prime.
- Many alignment methods have been evaluated in
terms of standard alignment criteria (SPFN, SPFN, column
score) and some have been evaluated in terms of
the impact on tree topology accuracy.
However, very few have been evaluated in terms of impact on
estimating numeric parameters
such as the GTR substitution matrix or branch lengths.
Perform such an evaluation.
- Design a method that can find outliers in
a set of sequences (i.e., can find the sequences that are not
homologous to the remaining sequences), under the
assumption that nearly all the input set are homologous to each other.
- Test POY using gap penalties that are not affine gap penalties
- Examine ways of computing a consensus alignment on the output from
PASTA, and evaluate for accuracy in comparison to the single alignment
returned by PASTA.
For example, compute the posterior
decoding of a random sample of the alignments PASTA computes (after
removing the first few alignments).
To compute the posterior decoding, you can install
BAli-Phy (Redelings and Suchard), using
the directions on the website:
Once the software is installed, there is a folder of
executable files called "bin", and within that
there is a folder called "alignment-max".
Examine ways to annotate the sites in a multiple sequence alignment
for reliability, based on examining
a set of alternate alignments obtained
by running a basic alignment method using different strategies
(e.g., different parameter settings).
Find codes for computing a point estimate of
an alignment given
an arbitrary set of multiple sequence alignments
(e.g., the posterior decoding algorithm in BAli-Phy, but also
look at T-Coffee), and compare them for accuracy and/or scalability.
Explore BAli-Phy to see if you can improve it.
For example, determine if you can use it to score a
given alignment/tree pair, to find a tree given a fixed alignment,
or to find the best alignment on a fixed tree.
Or see if you can improve it by giving it a good starting
Or see if it is missing any important substitution models, and if
so modify it to enable them.
- It is well known that over-alignment (where the computed alignment
is shorter than the true alignment) can lead to
false discovery of
positive selection (i.e., the conclusion that
there is positive selection even when there is
in fact no such positive selection).
What are the other impacts of over-alignment? For example,
does over-alignment result in the expansion of branch lengths?
It is known that many alignment methods (e.g., MAFFT and
Clustal-Omega) tend to over-align, so understanding the impact of
over-alignment is important.
Less is known about the impact of under-alignment, however, and
yet some methods (notably PASTA, Prank, and PAGAN) tend to
Evaluate methods that under-align to determin the impact
PASTA is a method that
iterates between tree estimation and alignment estimation, and has
very good accuracy in tree topology estimation on very large
However, PASTA often under-aligns (and so produces alignments
that are longer than they
Try to modify the output of PASTA so that you reduce the under-alignment
tendency, perhaps by considering the set of alignment/tree pairs it produces.
In "Benchmarking statistical multiple sequence alignment" (bioRxiv doi: https://doi.org/10.1101/304659), my students and I showed a strange trend in which
BAli-Phy did very well on simulated
datasets but not very well
on biological amino acid datasets.
Try to do this evaluation on nucleotide datasets.
Compare standard maximum likelihood methods
(that treat gaps as missing data) to BAli-Phy (which
treats gaps as insertions and deletions, under a statistical
model) on simulated data, given
an alignment (e.g.,
either the true alignment or an estimated alignment).
Projects related to supertree estimation or phylogenomic species tree estimation
- Develop a good method for weighted quartet tree amalgamation.
Compare to Weighted Quartets MaxCut by Avni et al. (2014),
Evaluate Weighted Quartets MaxCut (Avni et al. 2014)
as a supertree method on the SMIDgen datasets (Swenson et al.).
Implement a parallel version of some good supertree method, such as
FastRFS (Vachaspati and Warnow),
Quartets MaxCut (Snir and Rao), etc.
Quartet-based supertree methods have computational limitations
in that they depend on computing all quartet trees.
Evaluate variants that
only require computing a subset of the quartet trees.
Evaluate the impact of multiple sequence alignment error on
SVDquartets (Chifman and Kubatko, as implemented in PAUP*).
Projects related to maximum likelihood gene tree estimation
- Develop parallel implementation of FastTree-2.
- Evaluate the impact of the starting tree on FastTree-2.
- ML methods have been explored in terms of the impact of the
tree topology or the ML score, and in general RAxML has been
found to be better than other methods for ML score but only rarely
better than FastTree-2 in terms of tree topology.
Try to characterize the conditions in which RAxML is generally
more accurate than FastTree-2 (or other ML methods) for tree topology.
- Test some leading maximum likelihood methods on large datasets
(be careful - this will require a lot of computing time).
- Test some leading maximum likelihood method (e.g., RAxML) on
simulated datasets that evolved with heterotachy, and compare to maximum
Note: I don't know if any simulator exists that evolves
with heterotachy; if not, then a simpler project is to
create such a simulator.
- Modify some maximum likelihood method to take support values
per site into account.
- Develop a method that can compute a tree from an
matrix (i.e., matrices where some of the entries
do not have
Compare to the "NJ*" methods from
Criscuolo and Gascuel (2008),
on some simulated datasets.
- Find an application area where tree estimation (especially of
large datasets) is important and not necessarily solved well,
and where accuracy can be measured.
See if you can develop a technique for computing trees in this
context where the DCM-boosting approaches (see Chapter 11)
People who can help
- Mike Nute, PhD student in the Warnow lab. Mike can help with
any project related to multiple sequence alignment.
Contact him at email@example.com.
Erin Molloy, PhD student in the Warnow lab.
Erin can help with projects related to maximum likelihood gene tree
Contact Erin at firstname.lastname@example.org.
Pranjal Vachaspati, PhD student in the Warnow Lab.
Pranjal can help with projects related to phylogenomics and supertree methods,
and for methods for computing trees from incomplete distance matrices.
Contact Pranjal at email@example.com.