Genome 540
Introduction to Computational Molecular Biology:
Genome and Protein Sequence Analysis
(Winter Quarter 2022)
- Synopsis: Together with Genome 541, a two-quarter
introduction to protein and DNA
sequence analysis and molecular evolution, including probabilistic models of sequences and of sequence
evolution, computational gene identification, pairwise sequence
comparison and alignment (algorithms and statistical issues), multiple
sequence alignment and evolutionary tree construction, comparative
genomics, and protein sequence/structure relationships.
These represent the central computational approaches for
interpreting genomes. The
statistical and algorithmic ideas discussed (which include maximum
likelihood estimation, hidden Markov models, dynamic programming) have
wide applicability in other areas of computational & mathematical
biology as well.
- Prerequisites: Ability to write computer programs for data
analysis in a general purpose programming language such as C++ or Java (R and MATLAB are not suitable). Some
prior exposure to probability and statistics, and molecular biology, is
desirable but not essential.
- To register for credit, contact Phil Green (phg [ a t ] uw.edu) for permission (please include a brief summary of your programming experience). Enrollment is limited to 25 students. Auditing is allowed.
- Course organization: We follow the 'inverted lecture' model. Recorded lectures are to be viewed by students prior to the corresponding class discussion.
- Lectures: will be posted below as Zoom recordings & slide PDFs.
- Discussions: TuTh 10:30-11:30 in Foege S-110 (or via Zoom). Discussions (which are not recorded) may cover the lecture, homework, and related material.
- The approximate syllabus is below; also see last year's site.
- Instructors:
- Phil Green (phg [ a t ] uw.edu)
- TA: Chengxiang (CX) Qiu (cxqiu [ a t ] uw.edu)
- Office hours:
- CX: by appointment.
- Phil: by appointment.
- Grading & homework policies:
- The entire course grade is based on the homework assignments, which are
due weekly.
No tests or exams.
- The homework assignments involve writing programs for data analysis (using a general-purpose programming language -- R and MATLAB are not suitable for these assignments!), and running them on a computer you have access to (we do not provide computers). We don't require a specific language, since it is not
practical for us to grade your code, just the output from running your
programs. However it is important to use a language that allows you to
write programs that will run fast on large (genome-scale) datasets; ideally a compiled
language such as C, C++, or Java. You may be able to get by with an interpreted
language such as Python, however there is a
definite risk that programs in such languages will take very long to
run -- which means that you will need to get the program
written early enough that you can allow 2 or 3 days for the run and
still be able to turn in the assignment on time.
- Homework is due by 11:59 pm on the indicated date. Each assignment is worth 10 points. There is a late penalty of 1 point for each day (or fraction thereof) past the deadline, up to a maximum of 3 points. You can redo the assignment and take the new score if you wish. Assignments more than 1 week late will not be accepted unless you have obtained prior permission.
- Incorrectly formatted homework submissions will not be graded. They can be resubmitted, but a late penalty (as above) may apply.
- It is OK to run your program on someone else's input datafile, and
compare outputs to see if you get the same results. However it is not OK
to share programs, or to get someone else to debug your program. A key
part of the course is being able to write and debug your own programs for
data analysis.
HOMEWORK ASSIGNMENTS:
- Assignment 1, due Sunday Jan 16
- Assignment 2,, due Sunday Jan 23
- Assignment 3, due Sunday Jan 30
- Assignment 4, due Sunday Feb 6
- Assignment 5, due Sunday Feb 13
- Assignment 6, due Sunday Feb 20
- Assignment 7, due Sunday Feb 27
- Assignment 8, due Sunday Mar 6
- Assignment 9, due Sunday Mar 13
LECTURE SLIDES & VIDEOS:
- Lecture 1 Video (93 min). Slides (27).
Overviews of computational molecular biology, interpreting genomes, and this course.
- Lecture 2 Video (72 min). Slides (22).
Computation time issues; finding exact sequence matches using suffix arrays; hashtables.
- Week 1 discussion Slides
- Lecture 3 Video (100 min). Slides (33).
Probability models. 'Background' sequence models; failure of equal frequency & independence assumptions. Neutralist v selectionist interpretations. Markov models. Assessing significance of sequence patterns; simulations.
- Lecture 4 Video (78 min). Slides (24).
Site models: Assumptions, construction, examples, failures.
- Week 2 discussion Slides
- Lecture 5 Video (68 min). Slides (25).
Comparing probability models. Likelihood ratios and the Neyman-Pearson lemma. Weight matrices. Score distributions.
- Lecture 6 Video (76 min). Slides (29).
Average LRs and frequencies of site-like sequences in background. Relative entropies (average LLRs). Sequence logos.
- Week 3 discussion Slides
- Lecture 7 Video (90 min). Slides (34).
Directed acyclic graphs. Dynamic programming.
- Lecture 8 Video (73 min). Slides (20).
Weighted linked lists as simple WDAGs. Sequence graphs & compositionally biassed regions. Defining and interpreting scores. Other applications.
- Week 4 discussion Slides
- Lecture 9 Video (86 min). Slides (33).
Sequence alignment, evolution, & mutations. Edit graphs & Smith-Waterman alignment. Multiple sequence alignment: higher dimensional edit graphs, progressive alignment.
- Discussion Slides
- Lecture 10 Video (79 min). Slides (30).
Local vs global alignments. Alignment scoring: background models for proteins, score matrices, profiles. Statistical significance.
- Discussion Slides
- Lecture 11 Video (78 min). Slides (23).
Gap penalties. Word nucleation algorithms. Genome alignment.
- Discussion Slides
- Lecture 12 Video (65 min). Slides (24).
More on WDAGs: inverted WDAGs, fwd/backwd algorithm, finding multiple high-scoring paths. Multiple paths in edit graphs: Finding repeats. Multiple paths in WLLs. D-segments.
- Discussion Slides
- Lecture 13 Video (59 min). Slides (23).
Hidden Markov Models: Intro & definitions, examples.
- Discussion Slides
- Lecture 14 Video (56 min). Slides (18).
HMM probability calculations: WDAG, Viterbi algorithm. 2-state HMMs & D-segments.
- Discussion Slides
- Lecture 15 Video (72 min). Slides (25).
Forward & forward/backward algorithms. Parameter estimation: Viterbi & Baum-Welch training.
- Discussion Slides
- Lecture 16 Video (64 min). Slides (19).
Evolutionary trees. Tree-based probabilities for aligned sequences.
- Discussion Slides
- Lecture 17 Video (82 min). Slides (16).
Detecting sequence conservation with PhyloHMMs; PhastCons.
- Discussion Slides
- Lecture 18 Video (83 min). Slides (23).
PhastCons.
- Discussion Slides
- Lecture 19 In person lecture by CX.
- Lecture 20 Video (72 min). Slides (33).
Parsing genomes with HMMs.
- Discussion Slides
COURSE-RELATED MATERIALS:
SUGGESTED TEXTS (not required):
- Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic
Acids
by Richard Durbin, S. Eddy, A. Krogh, G. Mitchison; Cambridge University
Press, 1998. ISBN: 0521629713. www.amazon.com.
- Statistical Methods in Bioinformatics : An Introduction (Statistics for Biology and Health)
by Warren J. Ewens, Gregory R. Grant; Springer, 2005. ISBN: 0387400826. Note that this is the 2d edition!!
www.amazon.com.
C/C++ PROGRAMMING GUIDES: