Homework for CS 581, Fall 2018
Homework policies
Due date:
All homework is due at 10 PM on the due date, via Moodle (unless
otherwise specified).
Late homeworks (up to 48 hours late) can be accepted for reduced
credit (see course webpage for details), except when otherwise
specified.
Collaboration policy:
You are expected to write up the homework yourself, but you are
welcome to discuss the homework with other students in the class.
If you discuss the homework with other students, clearly specify this on
your homework.
However, you can collaborate on HW 9, as long as you
are both fully responsible for the work.
Reading assignments:
Some homeworks involve homework problems from the textbook, and
many homework assignments involve
reading the textbook or published papers.
The class discussion depends on you doing the reading,
as I will not be teaching all the material.
Review questions:
The textbook has two types of questions: review questions and
homework problems.
In general, I will be assigning problems from the
homework problems and not from the review questions (although
do note that this is not true for some homework assignments).
We may discuss
the review questions in class, so
please look over the review questions as well.
Disputing a grade:
Please come see me directly if you have questions
or concerns about how your homework was graded.
Grading policy:
The homework overall contributes 40% of the course grade,
and each homework (except the last) contributes the same amount.
The worst grade from the first 8 assignments is dropped.
The first 8 in total contribute 28 points to your grade
and the last assignment is worth 12 points.
Substitute homeworks:
You may propose to write a paper review or to
do a dataset analysis instead of the assigned homework
for a given week (except the last assignment).
If you wish to do this, meet with me to select the appropriate
paper or dataset analysis.
Assignments
 Homework 1. Due Thursday, September 6 (10 PM).
 Read Chapters 13, 5.15.5.
 HW problems:
Chapter 1 problems 1, 5, 10, and 11.
Chapter 2 problems 1, 5, and 21.

Homework 2. Due Thursday, September 13 (10 PM).
 Read Chapters 4, 5.65.11, 8.18.2, 8.48.5, 8.8
 HW problems:
Chapter 3 problems 3 and 22.
Chapter 4 problems 1, 5, 6, 17, and 18.

Homework 3.
Due Thursday September 20 (10 PM).

Read: Chapter 8.6, 8.7, 8.98.13

HW problems:
Chapter 5 problems 9, 12, 15, 16, and 18.
Chapter 8 problems 8 and 9.

Homework 4. Due Thursday, September 27 (10 PM).

Homework 5.
Due Thursday, October 4 (10 PM).

Read Chapter 9.69.16 and Appendix C.
 HW problems: Chapter 9, problems 11 and 13.

Read the following papers.
Pick two of the papers and
write a brief summary (2 paragraphs, maximum),
and come to class ready to discuss the papers.
In particular, be prepared to pose two questions or critiques.
The two questions can either be requests for clarification,
a critique of the paper (of the methodology or conclusions), or
a suggestion for followup research.
Pay close attention to the methodology used to evaluate the methods that are presented.

"Who watches the watchmen?", by Iantorno et al. (2014), DOI 10.1007/9781627036467_4

"PhylogenyAware Gap Placement Prevents Errors in Sequence Alignment and Evolutionary Analysis", by Loytynoja and Goldman (2008), DOI: 10.1126/science.1158395

"Fast, scalable generation of highquality protein multiple sequence alignments using Clustal Omega" by Sievers et al. (2011), DOI: 10.1038/msb.2011.75

"MCoffee: combining multiple sequence alignment methods with TCoffee" (2006) by Wallace et al. DOI: 10.1093/nar/gkl091.

Hw 6. Due Thursday, October 11 (10 PM).

Read Chapters 6.16.2, 7.17.9, 10.110.4.

HW problems: Chapter 6 problems 4 and 8
 Write a 12 page proposal for your final project (see class webpage for guidelines)

HW 7. Due Thursday, October 18 (10 PM).
 Read Chapter 10.510.6.
 Let T be a binary tree and let D(x,y) denote
the number of internal nodes on the path between leaves x and y in T.
Prove that the matrix D is additive for T.
(Note  this is related to the internode distance matrix
used in the NJst and ASTRID species tree estimation methods.)
 Read and critique
12 recently published papers (i.e., since 2015) on
methods for
species tree estimation from multilocus datasets;
write this up in a 35 page document and submit it on Moodle.

HW 8. Due Thursday, October 25 (10 PM).

Read Chapters 11.111.8.
 HW problems: Chapter 7, problem 2. Chapter 10, problems 1 and 2.

Submit revised project proposal, and include
a list of 35 papers that interest you
on a subject relevant to this course and that
will be related to your course project.
These papers should either be recently published
or  if published earlier  be important papers for
the research question.

HW 9.
Due Thursday Nov 29 (10 PM).
This homework cannot be
dropped.
Do one of the following:
 Develop a clustering algorithm (see below for
guidelines), implement it, use it to analyze
one or more datasets (as described below), and write up a
report on the algorithm and what you observed on the data.
Please note that your code must be made available to the T.A.
for him to test it on additional datasets.
Hence, you are required to provide your report in MOODLE and
your commented code by email.

The input will be a set of 1000 to 10,000 unaligned DNA sequences,
drawn from one of the simulated datasets (ROSE, Indelible, or RNASim)
studied in the SATé and PASTA papers.
Please see https://sites.google.com/eng.ucsd.edu/datasets/pastaupp for RNASim and Indelible, and
https://sites.google.com/eng.ucsd.edu/datasets/satei for ROSE datasets.

The output will be a collection of clusters (disjoint or nondisjoint,
depending on the purpose of the clustering).
Each cluster should have at least 100 sequences and at most
10% of the input sequences.
 Constraints:
You are not allowed to compute a multiple sequence alignment
on the full dataset (although you are allowed to compute
alignments on small subsets, if you wish). You should
not use PASTA, SATé, or any other existing divideandconquer
approach in your algorithm.

You should run your code
on one replicate
of the ROSE 1000M1 datasets (from the SATé paper) and make
sure it runs on your own
laptop or in some other low memory environment.
You may also want to run your code on one replicate of the
Indelible 10,000M1 datasets (from the PASTA paper).
Report the running time, memory usage, and return the output of your code
(clustering of the dataset into subsets).

Note: there are two purposes for this clustering: multiple
sequence alignment (which needs disjoint clusters) and
tree estimation (which needs overlapping clusters).
So you might also want to follow through on your
algorithm design by seeing how well your clusters work for
a divideandconquer MSA estimation or a divideandconquer
tree estimation protocol.
But if you do this, it would be part of a final project,
not for this homework.

Take one published biological dataset where the published phylogeny estimated
on the dataset was produced using
outdated methods, and reanalyze using newer methods.
Compare the trees you obtain and discuss in a written paper.
(Note, your paper should be written as though you were going to submit
it for publication in a journal.)

Take an unpublished biological dataset and construct a phylogeny
on it using
at least two good protocols.
Compare the trees you obtain and discuss in a written paper.
(Note, your paper should be written as though you were going to submit
it for publication in a journal.)

Alternatively:
Develop a new method for one of the main problems discussed this
semester, implement it, test it, and write up your
findings.
Note that this variant for this homework is very ambitious and
is only recommended if you wish to make this part of your
final course project.