Homework for undergrads learning about CS 581

Assignments

Homework 1.
- Read Chapters 1-3, 5.1-5.5.
- All review questions in Chapters 1, 2, and 3.
- HW problems: Chapter 1 problems 1, 7,8 Chapter 2 problems 1, 5, 21
Homework 2.
- Read Chapters 4, 5.6-5.11, 8.1-8.2, 8.4-8.5, 8.8
- All review questions in Chapters 4 and 5.
- HW problems: Chapter 3 problems 3 and 22. Chapter 4 problems 1 and 5
Homework 3.
- Read: Chapter 8.6, 8.7, 8.9-8.13
- All review questions in Chapter 8.
- HW problems: Chapter 5 problems 9, 16, and 18. Chapter 8 problems 1,2,11, and 12.
Homework 4.
- Read Chapter 9.1-9.5
- Read Mona Singh's lecture notes on profile HMMs.
- HW problems: Chapter 8, problems 1-12, 14, 15. Chapter 9, problems 1, 2, 11, and 12.
Homework 5.
- Read Chapter 9.6-9.16 and Appendix C.
- All review questions for Chapter 9.
- Read the following papers. Pick two of the papers and write a brief summary (2 paragraphs, maximum), and come to class ready to discuss the papers. In particular, be prepared to pose two questions or critiques. The two questions can either be requests for clarification, a critique of the paper (of the methodology or conclusions), or a suggestion for follow-up research. Pay close attention to the methodology used to evaluate the methods that are presented.
  - "Who watches the watchmen?", by Iantorno et al. (2014), DOI 10.1007/978-1-62703-646-7_4
  - "Phylogeny-Aware Gap Placement Prevents Errors in Sequence Alignment and Evolutionary Analysis", by Loytynoja and Goldman (2008), DOI: 10.1126/science.1158395
  - "Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega" by Sievers et al. (2011), DOI: 10.1038/msb.2011.75
  - "M-Coffee: combining multiple sequence alignment methods with T-Coffee" (2006) by Wallace et al. DOI: 10.1093/nar/gkl091.
Homework 6.
- Read Chapters 6.1-6.2, 7.1-7.9, 10.1-10.4.
- Homework problems: Chapter 6 problems 4 and 8
Homework 7.
- All review questions for Chapter 6 and 7.
- Read Chapter 10.5-10.6.
- Let T be a binary tree and let D(x,y) denote the number of internal nodes on the path between leaves x and y in T. Prove that the matrix D is additive for T. (Note - this is related to the internode distance matrix used in the NJst and ASTRID species tree estimation methods.)
Homework 8.
- All review questions for Chapter 10.
- HW problems: Chapter 7, problem 2. Chapter 10, problems 1 and 2.
Homework 9. Do one of the following:
- Develop a clustering algorithm (see below for guidelines), implement it, use it to analyze one or more datasets (as described below), and write up a report on the algorithm and what you observed on the data. Please note that your code must be made available to the T.A. for him to test it on additional datasets. Hence, you are required to provide your report in MOODLE and your commented code by email.
  - The input will be a set of 1000 to 10,000 unaligned DNA sequences, drawn from one of the simulated datasets (ROSE, Indelible, or RNASim) studied in the SATé and PASTA papers. Please see https://sites.google.com/eng.ucsd.edu/datasets/pastaupp for RNASim and Indelible, and https://sites.google.com/eng.ucsd.edu/datasets/sate-i for ROSE datasets.
  - The output will be a collection of clusters (disjoint or non-disjoint, depending on the purpose of the clustering). Each cluster should have at least 100 sequences and at most 10% of the input sequences.
  - Constraints: You are not allowed to compute a multiple sequence alignment on the full dataset (although you are allowed to compute alignments on small subsets, if you wish). You should not use PASTA, SATé, or any other existing divide-and-conquer approach in your algorithm.
  - You should run your code on one replicate of the ROSE 1000M1 datasets (from the SATé paper) and make sure it runs on your own laptop or in some other low memory environment. You may also want to run your code on one replicate of the Indelible 10,000M1 datasets (from the PASTA paper). Report the running time, memory usage, and return the output of your code (clustering of the dataset into subsets).
  - Note: there are two purposes for this clustering: multiple sequence alignment (which needs disjoint clusters) and tree estimation (which needs overlapping clusters). So you might also want to follow through on your algorithm design by seeing how well your clusters work for a divide-and-conquer MSA estimation or a divide-and-conquer tree estimation protocol.
- Take one published biological dataset where the published phylogeny estimated on the dataset was produced using outdated methods, and re-analyze using newer methods. Compare the trees you obtain and discuss in a written paper. (Note, your paper should be written as though you were going to submit it for publication in a journal.)
- Take an unpublished biological dataset and construct a phylogeny on it using at least two good protocols. Compare the trees you obtain and discuss in a written paper. (Note, your paper should be written as though you were going to submit it for publication in a journal.)