CS/BioE 589AGB final projects and presentations

There are two types of final projects you can do: a survey paper or a research paper. If you do a survey paper, you will do this by yourself, but if you do a research paper then you can do this with someone else. When the final project involves two people, each person must be equally involved in the project, and be able to answer questions about the project. You have a lot of freedom in what you do, and can pick something on your own. However, this document should help you think of things you might want to do. The class presentation you do (between March 31 and April 14) needs to be closely connected to your final project. Therefore, you should have a pretty clear plan for the final project when you pick the paper for your class presentation. While you are thinking about the general area for a final project, you might look at this.

Research Project

If you want to do a research project, please come see me to discuss the possible projects. Datasets for studying methods can be obtained at my old lab webpage, as well as from other projects. Here are some simple examples of research projects:

Biological dataset analysis

The projects in this category involve using a biological dataset (possibly one that has already been published and studied) and analyzing it using a number of different pipelines, in order to understand how method choice impacts discovery.

For example, in the context of gene tree estimation, you could vary the choice of multiple sequence alignment method and phylogeny estimation method, and you could also look at co-estimation methods (that estimate alignments and trees together, such as BAli-Phy), or even alignment-free estimation.
In the context of multi-locus species tree estimation, you could vary the methods used to estimate species trees (co-estimation of gene trees and species trees, summary methods, or single site methods). In each case, look at how changing your input data impacts the analysis. For example, how does taxon sampling impact the result? How does removing masking noisy sites (using various techniques, such as GBLOCKS) within alignments impact gene tree estimation? For species tree estimation from multi-locus data, how does deleting loci with poorly supported gene trees or collapsing low support branches in gene trees in a species tree estimation using summary methods impact the final tree? Focus on understanding the choice of method and the properties of the data impact the final biological discoveries. This particular project type is a natural outcome of Problem Set 3 from the midterm, but you'd probably explore more variations in the estimation process, and you'd also explore the impact of data modification on the anaysis.
You might also be interested in how the choice of gene tree estimation procedure (which includes the alignment estimation and tree estimation steps) impacts the detection of selection. It has already been observed that over-alignment (where sites contain non-homologous nucleotides) can result in the false positive detection of positive selection; however, as far as I know, the impact of under-alignment has not been studied. And it is not clear how the phylogeny estimation step impacts this question. Similarly, the impact of the gene tree estimation or species tree estimation procedure on other biological questions (e.g., predicting function or structure in proteins) has not been very well investigated.
Explore the impact of taxon identification method and dataset choice on microbiome analysis. For example, you could use different taxonomic profiling methods (e.g., KRAKEN and TIPP) or using different data (16S only, or metagenomic data using whole genome shotgun sequences), or based on different sequencing technologies (Illumina, PacBio, or other longer read sequencing technologies).
Some statistical methods for multiple sequence alignment (e.g., BAli-Phy) seem to perform very well on simulated data, but we don't know how well they performs on biological datasets. Take one of these methods (e.g., BAli-Phy, or perhaps PAGAN or Prank, but there are others) and compare it to leading alignment methods (MAFFT, PASTA, etc.) on biological datasets with benchmark alignments based on structure. Many of these methods (e.g., BAli-Phy) use MCMC and are computationally intensive; hence, this should be limited to small datasets (at most 25 sequences). You can subsample from larger datasets with structural alignments to produce these smaller datasets.

Exploring methods on simulated data

With biological datasets, one rarely really knows the true phylogeny or multiple sequence alignment, and so evaluating the impact of method choice on final biological discovery is complicated. If you wish, therefore, you can use simulated datasets to explore performance of methods. Many papers provide links to published simulated datasets that you can use. Examples of questions you might address on simulated data include:

Evaluating the impact of missing data on phylogeny estimation. For example, if you have the true alignment of a set of sequences, but you now delete a large fraction of the sites in a single sequence x, how does this impact phylogeny estimation? In particular, does it impact the accuracy of the tree topology? Does it impact the branch length estimation, in particular of the branch leading to the leaf for x?
Determining if alignment-free phylogeny estimation methods can be as accurate as good phylogeny estimation methods (e.g., to two-phase methods, or to PASTA). If so, under what conditions?
Determining how different MSA methods and tree estimation methods impact the estimation of parameters of the model tree beyond the topology -- such as branch lengths and the GTR matrix.
In PNAS 2013, Bouchard-Coté and Jordan presented a new method for co-estimating multiple sequence alignments and trees; however that method was not studied in comparison to other co-estimation methods nor to good two-phase methods for estimating trees from unaligned sequences. Therefore, a comparison of their method to good alternatives would help us evaluate whether their method is competitive.
Nearly all studies have explored accuracy under sequence evolution models that only include substitutions and simple indels; yet evolution also includes other events, such as tandem duplications and rearrangements. Hence we do not know how well methods perform under more realistic models. Find a sequence evolution simulator that includes these more complicated processes, and evaluate methods for estimating gene trees (either two-phase methods or co-estimation methods) from unaligned sequences on data generated by these simulators.
Phylogeny estimation often is based on a single point estimate of the multiple sequence alignment, and multiple sequence alignments impact the accuracy of the phylogeny. See if you can find ways to improve the estimation of the multiple sequence alignment by combining MSAs. Some methods, such as T-Coffee, are designed for this. Explore their performance in terms of improving the MSA estimation, and the impact that new MSA has on the phylogeny. This can be done on both biological and simulated data.
Explore the impact of alignment masking (as in GBLOCKS and similar methods) on multiple sequence alignment methods like Prank, Pagan, and UPP that have a tendency to under-align. This can be done on both biological and simulated data.
Explore the impact of including rogue taxa in a dataset by adding random sequences to your input sequence dataset. Concretely, suppose you have a set S of homologous sequences, and you add a random sequence x to the dataset, creating a larger set S'. Construct a ML tree T on S and then also construct an ML tree T' on S'. Now delete x from T', so that it is a tree on S. How similar are the two trees? What happens if you have more than a single non-homologous sequence? In other words, does the inclusion of random sequence data impact the phylogeny estimation? This is likely to depend on the methods you use, so consider different ways of computing alignments (e.g., MAFFT, Muscle, Clustal, and UPP) and trees (e.g., Neighbor Joining and ML).
Can we detect non-homologous sequences in datasets? Even if the inclusion of non-homologs does not impact phylogeny estimation (and it might!), the inclusion of non-homologs in a phylogeny is at a minimum misleading. Can we detect these non-homologs and delete them? Consider the use of phylogenies (finding long branches) and multiple sequence alignment methods (e.g., UPP) for this purpose.
We would like to know if phylogeny estimation methods (such as maximum likelihood, neighbor joining, etc.) are biased in terms of topological shape. For example, some methods may tend to make trees that are imbalanced, and perhaps others will tend to make trees that are balanced. There are methods for measuring how balanced a tree is, which can be used to test methods for being biased. Imagine you generate a model gene tree topology and calculate the measure of balance. Then you simulate evolution down the tree and estimate a gene tree from the sequences you simulate; this estimated tree also has a measure of balance. By changing how you compute gene trees (e.g., neighbor joining, maximum likelihood, etc), you can assess whether the method is biased towards some kind of topological shape. Measures to consider include the COLLESS measure of tree balance, and the beta-splitting model of Aldous.
- Write code to compute the COLLESS measure of a given rooted gene tree.
- Write code to compute the beta parameter for a given rooted gene tree.
- Explore the impact of the rate of evolution on being able to estimate large trees. You should do a simulation study with indels and substitutions, and then systematically scale the tree up and down, and explore what happens with a poor alignment method, a good alignment method, and the true alignment. See if there is a "sweet spot", and characterize the empirical statitstics of the range in which the results are optimized.
- Find a simulator for a sequence evolution model that is for models like the General Markov Model (which contains the GTR model), or some other more complex model than GTR. Explore the accuracy of tree estimation methods under this more complex model (e.g., maximum parsimony, neighbor joining, and maximum likelihood under simpler models). In other words, simulate under a more complex model, and then estimate under the simpler model. (You can also approximate this by simulating sequences under different GTR parameters but the same model tree topology, and concatenating the alignments; it wouldn't be the same way of exploring robustness, but it would be getting at a similar question.)
- Test different tree estimation methods (such as FastTree, RAxML, neighbor joining, and maximum parsimony) on datasets with fragmentary sequences, to determine whether the two methods behave differently. Things to evaluate: tree topology and branch length estimation.
- Evaluate the impact of correcting distances or not correcting distances on phylogeny estimation. Be sure to include datasets with different rates of evolution.
Find a simulator that evolves gene trees within a species tree under a duplication and loss scenario, and test methods for computing species trees from gene trees on datasets you generate. You can consider many types of methods, as long as they can handle multiple copies of species inside each gene tree; examples of such methods include MulRF, DupTree, and iGTP.
Evaluate the impact of "missing data" on species tree estimation methods, i.e., methods that combine estimated gene trees into a species tree. Here the missing data occur when not all of the the given gene trees contain all the species.

New method development

Multiple sequence alignment

Imagine the following divide-and-conquer style of multiple sequence alignment. The input is a set S of unaligned sequences. 1. Divide into two parts (somehow!). 2. Align each part using your preferred MSA method. 3. Build a profile Hidden Markov Model on each of the two MSAs you compute. 4. Align the two profile HMMs. Compare the result you get to what you would get by using your preferred MSA method on the full dataset. (Note, this is a very under-specified method - so you'd need to explore the design space.)
See if you can develop improved ways of combining multiple sequence alignments to get a better (more accurate) sequence alignment. For inputs, you can use methods like PASTA and SATé that produce many multiple sequence alignments for a given input of unaligned sequences, but you can also use any multiple sequence alignment method. Compare your method to techniques like T-Coffee that are designed for this. Explore the performance in terms of improving MSA estimation, and the impact that new MSA has on the phylogeny. This can be done on both biological datasets (that have structural alignments) and simulated datasets.

Gene tree estimation

The estimation of very large trees (with more than 10,000 sequences) is almost always done through standard two-phase methods: first align, then compute a ML tree on the alignment; even PASTA and SATé compute ML trees on the alignments they compute in each iteration. Yet this approach may not be scalable to large datasets. Can we improve this through divide-and-conquer? Suppose we have simplify the problem and assume we have a multiple sequence alignment: can we develop a fast and accurate way of computing trees from the alignment that is as accurate (hopefully) as running FastTree-2 or RAxML on the alignment? Consider the following divide-and-conquer style of tree estimation, given an input set of unaligned sequences: The input is a set S of sequences in an alignment. 1. Divide into two overlapping parts (somehow!). 2. Construct a tree on each part using your preferred method. 3. Merge the two trees into a supertree, using a preferred supertree method. Note this is a very under-specified method, so you'd need to explore the design space. The objective here is to have good accuracy on very large datasets. Your exploration of this should examine the largest datasets you can, but this is clearly going to be impacted by the computational infrastructure you have available. However, you should definitely compare the tree you get to what you would get using FastTree-2 on the alignment, since FastTree-2 is a very fast and relatively accurate ML method.

Phylogenomics

ASTRAL is designed for combining unrooted gene trees into an unrooted species tree using quartets. Modify ASTRAL to work with rooted gene trees, and test it.
ASTRID is a modification of NJst that is faster and can handle missing data. However, it is also different in that it uses FastME to compute species trees instead of NJ. See what happens if you replace FastME with other distance-based methods (e.g., the distance-based method FastTree). See what happens if you modify ASTRID so that it is based on a different way of calculating the distance matrix.
The question here is whether we can use species trees (either known or estimated) to improve the estimation of gene trees. This is a general topic of great interest, but the techniques depend on the causes for gene tree incongruence with the species tree (e.g., duplication/loss scenarios, incomplete lineage sorting, etc.). Find methods that "correct" gene trees using species trees, and evaluate how well they work. (Consider true vs. estimated species trees, and also gene trees that are estimated with with low to high error.)

Other

I hypothesize that maximum likelihood bootstrap gene trees are less accurate estimates of the true gene tree than the best maximum likelihood tree for gene sequence alignment. Run an experiment to test this. Visualize the results with MDS. Do the same thing with MrBayes, using the sample from the distribution produced by MrBayes. What percentage of the sample are closer to the true tree, the same distance, or further? Does this depend on the model tree properties and sequence length?
Compare visualization tools for large trees. What is each tool good for?
Compare visualization tools for large multiple sequence alignments. What is each tool good for?
We would like to have visualization tools that can compare two large trees, and identify places in the tree where they are different. Do such tools exist? If so, find them and evaluate how well they work.

Survey Paper

Writing a good survey paper is not trivial. You will need to understand the papers you are reading and have some insights into the different contributions made by different papers. The quality of your writing is very important, and you should think of this as something that you would be willing to submit to a journal in the form that you submit it for a grade. That means, among other things, no typos, no grammatical mistakes, a proper bibliography (with full bibliographical information), and thoughtful exposition. Also hand in hardcopy of the main papers you reference. Be careful, of course, not to include any text from any other paper, unless you put quotes around it and properly attribute it.

When you write a survey paper, you need to specifically identify the question you are interested in, and why it is interesting and important. You should explain controversies (if any), the leading approaches, and the evidence in favor or against each approach. You need, as always, to really be critical - not necessarily just accepting what the authors say, but pointing out limitations of their approach. Examples of possible topics for a survey paper include:

Ultra-fast methods for distance-based phylogeny estimation
Alignment-free tree estimation methods
Methods for detecting horizontal gene transfer (HGT) or constructing species trees in the presence of HGT
Methods for estimating species trees from gene trees when gene trees can differ due to incomplete lineage sorting
Methods for estimating species trees from gene trees when gene trees can differ due to duplication and loss
Models of evolution that are more complex than GTR, and so allow (for example) for dependencies between sites
Techniques for dating ancestral nodes
Techniques for inferring ancestral sequences
Genome-scale multiple alignment methods (taking rearrangements into account)
Genome rearrangement phylogeny (taking rearrangements into account)
Methods for detecting remote homology
Methods for masking noisy sites in multiple sequence alignments
Methods for combining information from a collection of multiple sequence alignments