Genome 540
Introduction to Computational Molecular Biology:
Genome and Protein Sequence Analysis
(Winter Quarter 2024)
- Synopsis: Together with Genome 541, a two-quarter
introduction to protein and DNA
sequence analysis and molecular evolution, including probabilistic models of sequences and of sequence
evolution, computational gene identification, pairwise sequence
comparison and alignment (algorithms and statistical issues), multiple
sequence alignment and evolutionary tree construction, comparative
genomics, and protein sequence/structure relationships.
These represent the central computational approaches for
interpreting genomes. The
statistical and algorithmic ideas discussed (which include maximum
likelihood estimation, hidden Markov models, dynamic programming) have
wide applicability in other areas of computational & mathematical
biology as well.
- Prerequisites: Ability to write computer programs for data
analysis in a general purpose programming language such as C++ or Java (R and MATLAB are not suitable). Some
prior exposure to probability and statistics, and molecular biology, is
desirable but not essential.
- To register for credit, contact Phil Green (phg [ a t ] uw.edu) for permission (please include a brief summary of your programming experience, after reading 'Grading & homework policies' below). Enrollment is limited to 25 students. Auditing is allowed.
- Course organization: We follow the 'inverted lecture' model. Recorded lectures are to be viewed by students prior to the corresponding class discussion.
- Lectures: will be posted below as Zoom recordings & slide PDFs.
- Discussions (not recorded): TuTh 10:30-11:30 in Foege S-110. Discussion topics include the lecture, homework, and related material.
- The approximate syllabus is below; also see last year's site.
- Instructors:
- Phil Green (phg [ a t ] uw.edu)
- TA: Cliff Rostomily (crosto [ a t ] uw.edu)
- Office hours:
- Cliff: by appointment.
- Phil: by appointment.
- Grading & homework policies:
- The entire course grade is based on the homework assignments, which are
due weekly.
No tests or exams.
- The homework assignments involve writing programs for data analysis (using a general-purpose programming language -- R and MATLAB are not suitable for these assignments!), and running them on a computer you have access to (we do not provide computers). We don't require a specific language, since it is not
practical for us to grade your code, just the output from running your
programs. However it is important to use a language that allows you to
write programs that will run fast on large (genome-scale) datasets; ideally a compiled
language such as C, C++, or Java. You may be able to get by with an interpreted
language such as Python, however there is a
definite risk that programs in such languages will take very long to
run -- which means that you will need to get the program
written early enough that you can allow 2 or 3 days for the run and
still be able to turn in the assignment on time.
- Homework is due by 11:59 pm on the indicated date. Each assignment is worth 10 points, with a score of 7.0 equivalent to a 'B'. There is a late penalty of half a point for each day (or fraction thereof) past the deadline. You can redo the assignment and take the new score if you wish.
- Incorrectly formatted homework submissions will not be graded. They can be resubmitted, but a late penalty (as above) may apply.
- It is OK to run your program on someone else's input datafile, and
compare outputs to see if you get the same results. However it is not OK
to share programs, or to get someone else to debug your program. A key
part of the course is learning to write and debug your own programs for
data analysis.
- You can use 'generative AI' (such as ChatGPT, or GitHub Copilot) to help you with debugging and with translating from Python to C. However the program and output you submit must be your own (not generated by AI).
HOMEWORK ASSIGNMENTS:
LECTURE SLIDES & VIDEOS:
- Lecture 1 Video (100 min). Slides (34). Text.
Overviews of computational molecular biology, interpreting genomes, and this course.
- Lecture 2 Video (72 min). Slides (22).
Computation time issues; finding exact sequence matches using suffix arrays; hashtables.
- Discussion 2 Slides
- Lecture 3 Video (100 min). Slides (33).
Probability models. 'Background' sequence models; failure of equal frequency & independence assumptions. Neutralist v selectionist interpretations. Markov models. Assessing significance of sequence patterns; simulations.
- Discussion 3 Slides
- Lecture 4 Video (78 min). Slides (24).
Site models: Assumptions, construction, examples, failures.
- Discussion 4 Slides
- Lecture 5 Video (68 min). Slides (25).
Comparing probability models. Likelihood ratios and the Neyman-Pearson lemma. Weight matrices. Score distributions.
- Discussion 5 Slides
- Lecture 6 Video (76 min). Slides (29).
Average LRs and frequencies of site-like sequences in background. Relative entropies (average LLRs). Sequence logos.
- Discussion 6 Slides
- Lecture 7 Video (90 min). Slides (34).
Directed acyclic graphs. Dynamic programming.
- Discussion 7 Slides
- Lecture 8 Video (73 min). Slides (20).
Weighted linked lists as simple WDAGs. Sequence graphs & compositionally biassed regions. Defining and interpreting scores. Other applications.
- Discussion 8 Slides
- Lecture 9 Video (86 min). Slides (33).
Sequence alignment, evolution, & mutations. Edit graphs & Smith-Waterman alignment. Multiple sequence alignment: higher dimensional edit graphs, progressive alignment.
- Discussion 9 Slides
- Lecture 10 Video (79 min). Slides (30).
Local vs global alignments. Alignment scoring: background models for proteins, score matrices, profiles. Statistical significance.
- Discussion 10 Slides
- Lecture 11 Video (78 min). Slides (23).
Gap penalties. Word nucleation algorithms. Genome alignment.
- Discussion 11 Slides
- Lecture 12 Video (65 min). Slides (24).
More on WDAGs: inverted WDAGs, fwd/backwd algorithm, finding multiple high-scoring paths. Multiple paths in edit graphs: Finding repeats. Multiple paths in WLLs. D-segments.
- Discussion 12 Slides
- Lecture 13 Video (59 min). Slides (23).
Hidden Markov Models: Intro & definitions, examples.
- Discussion 13 Slides
- Lecture 14 Video (56 min). Slides (18).
HMM probability calculations: WDAG, Viterbi algorithm. 2-state HMMs & D-segments.
- Discussion 14 Slides
- Lecture 15 Video (72 min). Slides (25).
Forward & forward/backward algorithms. Parameter estimation: Viterbi & Baum-Welch training.
- Discussion 15 Slides
- Lecture 16 Video (64 min). Slides (19).
Evolutionary trees. Tree-based probabilities for aligned sequences.
- Discussion 16 Slides
- Lecture 17 Video (82 min). Slides (16).
Detecting sequence conservation with PhyloHMMs; PhastCons.
- Discussion 17 Slides
- Lecture 18 Video (83 min). Slides (23).
PhastCons.
- Discussion 18 Slides
- Lecture 19
(TBA: In-person lecture by Cliff).
- Lecture 20 Video (72 min). Slides (33).
Parsing genomes with HMMs.
- Discussion 20 Slides
COURSE-RELATED MATERIALS:
SUGGESTED TEXTS (not required):
- Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic
Acids
by Richard Durbin, S. Eddy, A. Krogh, G. Mitchison; Cambridge University
Press, 1998. ISBN: 0521629713. www.amazon.com.
- Statistical Methods in Bioinformatics : An Introduction (Statistics for Biology and Health)
by Warren J. Ewens, Gregory R. Grant; Springer, 2005. ISBN: 0387400826. Note that this is the 2d edition!!
www.amazon.com.
C/C++ PROGRAMMING GUIDES: