Homework for CS 581, Fall 2023
Homework policies
Due date:
All homework is due at 10 PM on the due date, via Moodle (unless
otherwise specified).
Late homeworks (up to 48 hours late) can be accepted for reduced
credit:
80% if within 24 hours and 60% if within 48 hours.
Form
All homeworks should be typed (latex is best, but you can also do Word
if you prefer) and then submitted as PDF.
No handwritten results allowed, {\em except} if you are drawing something
and are including an image in the homework.
Collaboration policy:
You are expected to write up the homework yourself, but you are
encouraged to discuss the homework with other students in the class.
If you discuss the homework with other students, clearly specify who you
worked with on
your submitted homework.
Reading assignments:
Some homeworks involve homework problems from the textbook, and
many homework assignments involve
reading the textbook or published papers.
The class discussion depends on you doing the reading,
as I will not be teaching all the material.
Homeworks connected to student presentations
Note that starting sometime in October, each student will be presenting
one or more papers in the class.
That presentation, plus the discussion afterwards,
will use up the entire class.
Every student is expected to read the papers in advance,
and there will be a homework assignment (for each paper)
with submission date (in Moodle) the evening before the presentation.
That homework assignment includes writing at least 3 paragraphs
about the paper and providing questions
for the author (see Moodle for instructions).
The average of the homework grades, across these 7 assignments,
will count for 3 homework grades.
Review questions:
The textbook has two types of questions: review questions and
homework problems.
In general, I will be assigning problems from the
homework problems and not from the review questions (although
do note that this is not true for some homework assignments).
We may discuss
the review questions in class, so
please look over the review questions as well.
Disputing a grade:
Please come see me directly if you have questions
or concerns about how your homework was graded.
Grading policy:
The homework overall contributes 40% of the course grade,
and each homework (unless otherwise specified) contributes the same amount.
The worst two hw grades are dropped.
Reading Assignments
The assignments from the textbook do not include the "Further Reading"
sections.
Also note that we will
be including assigned reading from the scientific literature,
with 1-2 papers a week assigned after we finish going through the
textbook assignments.
In general, reading about 10-20 pages each week is to be expected.
The due dates for the reading assignments are the night before the class meeting,
which means you should come to class having already done the assigned reading.
- August 28: Chapter 1
(PDF)
and Chap 2.1-2.2, 3.1-3.4.2
-
August 30: Chapters 2.3-2.10, 3.4.3-3.5, 5.1-5.5.1, 6.1-6.2.
- September 4: Chapter 4.1-4.7, 8.1-8.8
- September 6: Chapters 5.6-5.7, 5.10, 5.11, 8.13
- September 11: Chapter 10.1-10.5,
and Zaharias and Warnow 2022
- September 13: Chapter 10.6-10.9, 7.1-7.7.9, 11.8
- September 18: Chapter 9.1-9.5
- September 20: Chapter 9.6-9.12
- September 27: "Hobgoblin of Phylogenetics", Nature 1994
(HTML).
Also read the following papers, which address the impact of
taxon sampling. These papers are just a few of the papers
written in response to the paper by Hillis on the Hobgoblin of Phylogenetics.
-
Junhyong Kim.
"Large-scale phylogenies and measuring the performance of phylogenetic estimators",
Systematic Biology 1988.
DOI: 10.1080/106351598261021
-
Ziheng Yang and Nick Goldman. "Are big trees indeed easy?." Trends in Ecology and Evolution 12.9 (1997): 357-357.
DOI: 10.1016/s0169-5347(97)83196-5.
-
David Hillis.
"Are big trees indeed easy? Reply from
D.M. Hillis".
Trends in Ecology and Evolution 12.9 (1997).
DOI: 10.1016/s0169-5347(97)83198-9
- October 3
-
David Hillis, "Inferring Complex Phylogenies",
Nature Vol. 383, pages 130-131, Sept 1996.
-
October 16
- Chapter 8.9 in the textbook
- Chapter 7.10
-
P. Zaharias, M. Grosshauser, and T. Warnow (2021). Re-evaluating deep neural networks for phylogeny estimation: The issue of taxon sampling. Journal of Computational Biology (2022), special issue for RECOMB 2021 link
- October 23
-
Jiang, Yueyu, et al. "DEPP: deep learning enables extending species trees using single genes." Systematic Biology 72.1 (2023): 17-34. link
-
Sapoval, Nicolae, et al. "Current progress and open challenges for applying deep learning across the biosciences." Nature Communications 13.1 (2022): 1728.
link
-
Azouri, Dana, et al. "Harnessing machine learning to guide phylogenetic-tree search algorithms." Nature communications 12.1 (2021): 1983.
link
-
Azouri, Dana, et al. "The tree reconstruction game: phylogenetic reconstruction using reinforcement learning." arXiv preprint arXiv:2303.06695 (2023).
link
-
Smith and Hahn. Phylogenetic inference using generative adversarial networks. Bioinformatics, 2023, 39(9), btad543.
link
-
Zou, Z., Zhang, H., Guan, Y., et al. 2020. Deep residual neural networks resolve quartet molecular phylogenies. Mol. Biol. Evol. 37, 1495–1507. link
-
Wang, Z. et al.
Fusang: a framework for phylogenetic tree inference via deep learning.
Nucleic Acids Research, gkad805, 2023.
(HTML)
- October 28.
Chapter 11.
Homework Assignments
- Homework 0. Due Tuesday, August 29. This will be graded but
the grade will not count towards the course grade.
Note the reading assignment for Aug 28 covers Chapter 1 and
and Chapters 2.1-2.2, 3.1-3.4.2.
-
All review questions for Chapter 1.
- Chapter 1, problems 1, 8, 10, and 11.
- Appendix B, problem 21
- Prove by induction on n (where n is a positive integer)
that every connected graph on n nodes
has at least n-1 edges.
- Homework 1. Due Friday, September 1, 10 PM.
Note the reading assignment for Aug 30 covers
Chapters 2.3-2.10, 3.4.3-3.5, 5.1-5.5.1, 6.1-6.2.
-
Chapter 2: problems 29 and 32.
-
Chapter 3: problems 5 and 22.
-
Homework 2. Due Friday, September 8, 10 PM.
Note the reading assignment for
September 4: Chapter 4.1-4.7, 8.1-8.8 and
September 6: Chapters 5.6-5.7, 5.10, 5.11, 8.13.
-
Chapter 4 problems 1, 5, 6, 17.
-
Chapter 5 problems 7, 9, 12, 18.
-
Chapter 8 problems 7-9.
-
Homework 3. Due Friday, September 15, 10 PM.
Note the reading assignment
September 11: Chapter 10.1-10.5,
Zaharias and Warnow 2022, and
September 13: Chapter 10.6-10.9, 7.1-7.7.9, 11.8.
- Chapter 6, problem 2,4,8
- Chapter 7, review questions 1,2,4.
- Write a 1-2 page summary of the assigned paper (Zaharias and
Warnow).
Criticisms and discussion are valued in these summaries.
- Write a 1-2 page summary of another paper, selected from the
bibliography from Zaharias and Warnow.
Criticisms and discussion are valued in these summaries.
-
Homework 4. Due Friday, September 22, 10 PM.
Note assigned reading of Chapter 9 from textbook.
- Chapter 10, problems 4-7, 10
- Homework 5. Due Wednesday, October 4.
- Read
all assigned papers from Sept 27, and also read
paper by David Hillis, "Inferring Complex Phylogenies",
Nature Vol. 383, pages 130-131, Sept 1996.
Write a paper
discussing all the papers, making sure to be
thoughtful about what the key points of the debate are.
Please provide answers to the following
questions:
-
Do the different papers
address the same question or different questions?
If different questions, what questions are they?
- Do the authors agree with each other?
- Can you reconcile
the differences of opinion?
- Do you think that big trees are indeed easy? What is the evidence
for or against this statement?
- Finally, please examine the papers to see where the word "consistent" or "consistency:
comes up. Comment on the usage of the term in each case.
Make sure to write a scholarly paper with good grammar, citation practice,
paragraph structure, etc.
Your paper should be at least 3 pages long, including references.
- Homework 6. Due Wednesday, October 11, 10 PM (but
late homework past this time not allowed).
Note that this homework counts for two assignments (noted in Moodle as HW6 and HW6b).
Half of the points (one homework worth) covers the
figures/tables for all analyses you complete.
The other half of the points are for the write-up.
-
In this
week's homework, you will be doing analyses of datasets using
phylogeny estimation software, and you will write it all up as well.
The collaboration policy for this homework is you can discuss
with others, but you must write your own scripts, analyze the
data yourself, and write it all up yourself.
You will need to get help from
the TA, so please get ready.
Write up your results sufficiently for the experiment you did to be reproducible
(i.e., method version numbers and commands).
Comment on: what did you expect to see? What did you see?
What did you learn?
Include properly cited references for any software that you use.
-
This is for gene tree estimation, using ``true alignments".
Familiarize yourself with
the description of the
simulated datasets from the 1000M1 and 1000M4 model conditions
(see this webpage)
from the original
paper that introduced them (Liu et al., "Rapid and Accurate Large-Scale Coestimation of Sequence Alignments and Phylogenetic Trees," Science, vol. 324, no. 5934, pp. 1561-1564, 19 June 2009.).
The datasets themselves are at this page.
You will construct trees on true alignments for 5 replicates,
using 5 methods: FastTree (under GTR and also under JC) and Neighbor Joining
(using logdet, JC distances, and p-distances).
Note, you may also want to try some other methods, but first get
the 5 listed above.
For each tree you compute, compare it to the PIMT for
the dataset, and record the FN and FP error rates.
Do not be surprised if they are not the same, and do think about what this means.
Make a table or figure of average tree errors for each of the five methods for
the two models.
For each of the two model conditions (i.e., 1000M1 and 1000M4), comment on all trends you observe, including:
- Which method is the most accurate?
- Does changing the estimation model for FastTree impact accuracy?
If so, how?
- Does changing the distance correction for Neighbor Joining
impact accuracy? If so, how?
Next, compare results under the model conditions 1000M1 and 1000M4:
- Do any of the trends above change as you change model conditions?
- Does the relative accuracy of methods remain the same between
the model conditions?
- One of the model conditions is "easier" (in that the methods have higher
accuracy); which one? And what is it about the model condition (i.e.,
numeric parameters and/or empirical statistics of the
sequences that are produced) that
suggests why that model condition is easier than the other?
Also comment on the following:
- Describe how the data were generated (this requires you look at the paper from Science 2009); what sequence evolution model was used?
- Which of the methods that you ran is statistically consistent for the model that generated the data? Justify
your answers.
- Did the statistically consistent methods
always produce more accurate trees than the methods not
guaranteed to be consistent?
Note: your grade for this assignment will be based on
both content (75) and writing (25%).
Reproducibilty is part of content.
Please review the "Guidelines on writing assignments", on the course
webpage.
- Homework 7. Due Wednesday, October 18.
-
Read the assigned reading for Oct 16.
-
Answer Review Questions 3 and 5-7 in Chapter 7.
-
Do problem 1-2 in Chapter 7.
- Homework. Due Saturday, October 21: Submit draft of course project
proposal (will not be graded)
- Homework 8. Due Monday, October 23:
See the assigned reading for October 23 (note that it has 7 papers
listed).
You will need to present one or two in the class.
Submit a paper with a brief paragraph about 6 of the papers, and
whether you would like to present it (and why).
After I receive this homework, I will assign the papers to students for
class presentation.
- Homework. Due Wednesday, October 25:
Submit revision of course project
proposal (will not be graded)
- Homework 9. Due Monday, October 30.
Do one of the following (not more!).
Also, if you want to do the third problem, make
sure you start early (as you must do this on your own,
without assistance from the TA or from me).
Note also that writing will be part of the grading for the second and third problems.
- Chapter 11, Review questions and problem 10. Extra credit: do problem 2.
-
Compare two alignment methods (Clustal-Omega and one other method of your choice) on 5 replicates
of the 1000M1 datasets for alignment and resultant tree accuracy (use the same
replicates you studied in Homework 6).
For alignment error/accuracy, report SPFN, SPFP, and TC, using FastSP for these calculations.
For tree error, report results obtained using FastTree2, and report FN/FP with respect
to the PIMT.
Discuss the trends you see. Which method has higher error?
Compare the results you obtain using these
estimated alignments to the results you obtained on the true alignment for Homework 6:
how does alignment error impact tree error?
I recommend you use as your second method something that has been used in the
MAGUS paper to align nucleotide sequences, and that is fast (note, among other things
this means you should not use BAli-Phy).
Make sure to provide sufficient
details (for reproducibility), and to use journal-quality writing.
You may benefit from reading the detailed description of how
analyses are run (including how FastSP is used to calculate
errors) in supplementary materials for alignment methods
(such as those in (the MAGUS paper)).
-
Read these two papers: the uDance paper.
and the FRACTAL paper.
Note what each paper aims to achieve (i.e., what assumptions about the input, model of
evolution, etc.).
How do these two papers compare, in terms of goals and achievements?
What are the underlying algorithmic techniques each uses?
Describe how each might be modified so that it could be used to estimate large trees of other types
(i.e., for other types of data).
Compare to other methods you are aware of for estimating large phylogenetic trees,
(e.g., using, as an example, the GTM pipelines used for gene trees or species trees in
the papers you have read).
Your write-up should be something that you might submit for publication
as a review both of these papers and also commentary about large
tree estimation, based on relevant related papers.
If you do this assignment, your write-up should be at least 3 pages, not including
the references, in 12 pt font, and you should make a point of including papers other than these in
your bibliography (and in your discussion).
- Homework 10. Due Wednesday Nov 1.
Write a paper, at least 1000 words (not including the
bibliography in this count), that summarizes
the presentation by Chengze Shen on Oct 26 and addresses the
following questions.
What does WITCH do? What does EMMA do?
What was the motivation in WITCH, and how does it compare to UPP in
terms of design and objectives?
What does EMMA do, and what was the motivation for EMMA?
What are the conditions where you would want to use one or the other
method, compared to using (say) some other MSA method>
Note: if you missed the presentation by Chengze Shen, you can answer these
questions by reading the WITCH and EMMA papers.
- Homework 11. Due November 5:
Submit final course project proposal (will be graded)