Homework for undergrads learning about CS 581
Assignments
 Homework 1.
 Read Chapters 13, 5.15.5.
 All review questions in Chapters 1, 2, and 3.
 HW problems:
Chapter 1 problems 1, 7,8
Chapter 2 problems 1, 5, 21

Homework 2.
 Read Chapters 4, 5.65.11, 8.18.2, 8.48.5, 8.8
 All review questions in Chapters 4 and 5.
 HW problems:
Chapter 3 problems 3 and 22.
Chapter 4 problems 1 and 5

Homework 3.

Read: Chapter 8.6, 8.7, 8.98.13
 All review questions in Chapter 8.

HW problems:
Chapter 5 problems 9, 16, and 18.
Chapter 8 problems 1,2,11, and 12.

Homework 4.

Homework 5.

Read Chapter 9.69.16 and Appendix C.

All review questions for Chapter 9.

Read the following papers.
Pick two of the papers and
write a brief summary (2 paragraphs, maximum),
and come to class ready to discuss the papers.
In particular, be prepared to pose two questions or critiques.
The two questions can either be requests for clarification,
a critique of the paper (of the methodology or conclusions), or
a suggestion for followup research.
Pay close attention to the methodology used to evaluate the methods that are presented.

"Who watches the watchmen?", by Iantorno et al. (2014), DOI 10.1007/9781627036467_4

"PhylogenyAware Gap Placement Prevents Errors in Sequence Alignment and Evolutionary Analysis", by Loytynoja and Goldman (2008), DOI: 10.1126/science.1158395

"Fast, scalable generation of highquality protein multiple sequence alignments using Clustal Omega" by Sievers et al. (2011), DOI: 10.1038/msb.2011.75

"MCoffee: combining multiple sequence alignment methods with TCoffee" (2006) by Wallace et al. DOI: 10.1093/nar/gkl091.

Homework 6.

Read Chapters 6.16.2, 7.17.9, 10.110.4.

Homework problems: Chapter 6 problems 4 and 8

Homework 7.
 All review questions for Chapter 6 and 7.
 Read Chapter 10.510.6.
 Let T be a binary tree and let D(x,y) denote
the number of internal nodes on the path between leaves x and y in T.
Prove that the matrix D is additive for T.
(Note  this is related to the internode distance matrix
used in the NJst and ASTRID species tree estimation methods.)

Homework 8.
 All review questions for Chapter 10.
 HW problems: Chapter 7, problem 2. Chapter 10, problems 1 and 2.

Homework 9.
Do one of the following:
 Develop a clustering algorithm (see below for
guidelines), implement it, use it to analyze
one or more datasets (as described below), and write up a
report on the algorithm and what you observed on the data.
Please note that your code must be made available to the T.A.
for him to test it on additional datasets.
Hence, you are required to provide your report in MOODLE and
your commented code by email.

The input will be a set of 1000 to 10,000 unaligned DNA sequences,
drawn from one of the simulated datasets (ROSE, Indelible, or RNASim)
studied in the SATé and PASTA papers.
Please see https://sites.google.com/eng.ucsd.edu/datasets/pastaupp for RNASim and Indelible, and
https://sites.google.com/eng.ucsd.edu/datasets/satei for ROSE datasets.

The output will be a collection of clusters (disjoint or nondisjoint,
depending on the purpose of the clustering).
Each cluster should have at least 100 sequences and at most
10% of the input sequences.
 Constraints:
You are not allowed to compute a multiple sequence alignment
on the full dataset (although you are allowed to compute
alignments on small subsets, if you wish). You should
not use PASTA, SATé, or any other existing divideandconquer
approach in your algorithm.

You should run your code
on one replicate
of the ROSE 1000M1 datasets (from the SATé paper) and make
sure it runs on your own
laptop or in some other low memory environment.
You may also want to run your code on one replicate of the
Indelible 10,000M1 datasets (from the PASTA paper).
Report the running time, memory usage, and return the output of your code
(clustering of the dataset into subsets).

Note: there are two purposes for this clustering: multiple
sequence alignment (which needs disjoint clusters) and
tree estimation (which needs overlapping clusters).
So you might also want to follow through on your
algorithm design by seeing how well your clusters work for
a divideandconquer MSA estimation or a divideandconquer
tree estimation protocol.

Take one published biological dataset where the published phylogeny estimated
on the dataset was produced using
outdated methods, and reanalyze using newer methods.
Compare the trees you obtain and discuss in a written paper.
(Note, your paper should be written as though you were going to submit
it for publication in a journal.)

Take an unpublished biological dataset and construct a phylogeny
on it using
at least two good protocols.
Compare the trees you obtain and discuss in a written paper.
(Note, your paper should be written as though you were going to submit
it for publication in a journal.)