Homework for CS 581, Fall 2018

Homework policies

Due date: All homework is due at 10 PM on the due date, via Moodle (unless otherwise specified). Late homeworks (up to 48 hours late) can be accepted for reduced credit (see course webpage for details), except when otherwise specified.

Collaboration policy: You are expected to write up the homework yourself, but you are welcome to discuss the homework with other students in the class. If you discuss the homework with other students, clearly specify this on your homework. However, you can collaborate on HW 9, as long as you are both fully responsible for the work.

Reading assignments: Some homeworks involve homework problems from the textbook, and many homework assignments involve reading the textbook or published papers. The class discussion depends on you doing the reading, as I will not be teaching all the material.

Review questions: The textbook has two types of questions: review questions and homework problems. In general, I will be assigning problems from the homework problems and not from the review questions (although do note that this is not true for some homework assignments). We may discuss the review questions in class, so please look over the review questions as well.

Disputing a grade: Please come see me directly if you have questions or concerns about how your homework was graded.

Grading policy: The homework overall contributes 40% of the course grade, and each homework (except the last) contributes the same amount. The worst grade from the first 8 assignments is dropped. The first 8 in total contribute 28 points to your grade and the last assignment is worth 12 points.

Substitute homeworks: You may propose to write a paper review or to do a dataset analysis instead of the assigned homework for a given week (except the last assignment). If you wish to do this, meet with me to select the appropriate paper or dataset analysis.

Assignments

Homework 1. Due Thursday, September 6 (10 PM).
- Read Chapters 1-3, 5.1-5.5.
- HW problems: Chapter 1 problems 1, 5, 10, and 11. Chapter 2 problems 1, 5, and 21.
Homework 2. Due Thursday, September 13 (10 PM).
- Read Chapters 4, 5.6-5.11, 8.1-8.2, 8.4-8.5, 8.8
- HW problems: Chapter 3 problems 3 and 22. Chapter 4 problems 1, 5, 6, 17, and 18.
Homework 3. Due Thursday September 20 (10 PM).
- Read: Chapter 8.6, 8.7, 8.9-8.13
- HW problems: Chapter 5 problems 9, 12, 15, 16, and 18. Chapter 8 problems 8 and 9.
Homework 4. Due Thursday, September 27 (10 PM).
- Read Chapter 9.1-9.5
- Read Mona Singh's lecture notes on profile HMMs.
- HW problems: Chapter 8, problems 13, 14, 17, and 18. Chapter 9, problems 1, 3, and 4.
Homework 5. Due Thursday, October 4 (10 PM).
- Read Chapter 9.6-9.16 and Appendix C.
- HW problems: Chapter 9, problems 11 and 13.
- Read the following papers. Pick two of the papers and write a brief summary (2 paragraphs, maximum), and come to class ready to discuss the papers. In particular, be prepared to pose two questions or critiques. The two questions can either be requests for clarification, a critique of the paper (of the methodology or conclusions), or a suggestion for follow-up research. Pay close attention to the methodology used to evaluate the methods that are presented.
  - "Who watches the watchmen?", by Iantorno et al. (2014), DOI 10.1007/978-1-62703-646-7_4
  - "Phylogeny-Aware Gap Placement Prevents Errors in Sequence Alignment and Evolutionary Analysis", by Loytynoja and Goldman (2008), DOI: 10.1126/science.1158395
  - "Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega" by Sievers et al. (2011), DOI: 10.1038/msb.2011.75
  - "M-Coffee: combining multiple sequence alignment methods with T-Coffee" (2006) by Wallace et al. DOI: 10.1093/nar/gkl091.
Hw 6. Due Thursday, October 11 (10 PM).
- Read Chapters 6.1-6.2, 7.1-7.9, 10.1-10.4.
- HW problems: Chapter 6 problems 4 and 8
- Write a 1-2 page proposal for your final project (see class webpage for guidelines)
HW 7. Due Thursday, October 18 (10 PM).
- Read Chapter 10.5-10.6.
- Let T be a binary tree and let D(x,y) denote the number of internal nodes on the path between leaves x and y in T. Prove that the matrix D is additive for T. (Note - this is related to the internode distance matrix used in the NJst and ASTRID species tree estimation methods.)
- Read and critique 1-2 recently published papers (i.e., since 2015) on methods for species tree estimation from multi-locus datasets; write this up in a 3-5 page document and submit it on Moodle.
HW 8. Due Thursday, October 25 (10 PM).
- Read Chapters 11.1-11.8.
- HW problems: Chapter 7, problem 2. Chapter 10, problems 1 and 2.
- Submit revised project proposal, and include a list of 3-5 papers that interest you on a subject relevant to this course and that will be related to your course project. These papers should either be recently published or - if published earlier - be important papers for the research question.
HW 9. Due Thursday Nov 29 (10 PM). This homework cannot be dropped. Do one of the following:
- Develop a clustering algorithm (see below for guidelines), implement it, use it to analyze one or more datasets (as described below), and write up a report on the algorithm and what you observed on the data. Please note that your code must be made available to the T.A. for him to test it on additional datasets. Hence, you are required to provide your report in MOODLE and your commented code by email.
  - The input will be a set of 1000 to 10,000 unaligned DNA sequences, drawn from one of the simulated datasets (ROSE, Indelible, or RNASim) studied in the SATé and PASTA papers. Please see https://sites.google.com/eng.ucsd.edu/datasets/pastaupp for RNASim and Indelible, and https://sites.google.com/eng.ucsd.edu/datasets/sate-i for ROSE datasets.
  - The output will be a collection of clusters (disjoint or non-disjoint, depending on the purpose of the clustering). Each cluster should have at least 100 sequences and at most 10% of the input sequences.
  - Constraints: You are not allowed to compute a multiple sequence alignment on the full dataset (although you are allowed to compute alignments on small subsets, if you wish). You should not use PASTA, SATé, or any other existing divide-and-conquer approach in your algorithm.
  - You should run your code on one replicate of the ROSE 1000M1 datasets (from the SATé paper) and make sure it runs on your own laptop or in some other low memory environment. You may also want to run your code on one replicate of the Indelible 10,000M1 datasets (from the PASTA paper). Report the running time, memory usage, and return the output of your code (clustering of the dataset into subsets).
  - Note: there are two purposes for this clustering: multiple sequence alignment (which needs disjoint clusters) and tree estimation (which needs overlapping clusters). So you might also want to follow through on your algorithm design by seeing how well your clusters work for a divide-and-conquer MSA estimation or a divide-and-conquer tree estimation protocol. But if you do this, it would be part of a final project, not for this homework.
- Take one published biological dataset where the published phylogeny estimated on the dataset was produced using outdated methods, and re-analyze using newer methods. Compare the trees you obtain and discuss in a written paper. (Note, your paper should be written as though you were going to submit it for publication in a journal.)
- Take an unpublished biological dataset and construct a phylogeny on it using at least two good protocols. Compare the trees you obtain and discuss in a written paper. (Note, your paper should be written as though you were going to submit it for publication in a journal.)
- Alternatively: Develop a new method for one of the main problems discussed this semester, implement it, test it, and write up your findings. Note that this variant for this homework is very ambitious and is only recommended if you wish to make this part of your final course project.