Genome 540: Computational Molecular Biology

Genome 540

Introduction to Computational Molecular Biology:

Genome and Protein Sequence Analysis

(Winter Quarter 2024)

Synopsis: Together with Genome 541, a two-quarter introduction to protein and DNA sequence analysis and molecular evolution, including probabilistic models of sequences and of sequence evolution, computational gene identification, pairwise sequence comparison and alignment (algorithms and statistical issues), multiple sequence alignment and evolutionary tree construction, comparative genomics, and protein sequence/structure relationships. These represent the central computational approaches for interpreting genomes. The statistical and algorithmic ideas discussed (which include maximum likelihood estimation, hidden Markov models, dynamic programming) have wide applicability in other areas of computational & mathematical biology as well.
Prerequisites: Ability to write computer programs for data analysis in a general purpose programming language such as C++ or Java (R and MATLAB are not suitable). Some prior exposure to probability and statistics, and molecular biology, is desirable but not essential.
To register for credit, contact Phil Green (phg [ a t ] uw.edu) for permission (please include a brief summary of your programming experience, after reading 'Grading & homework policies' below). Enrollment is limited to 25 students. Auditing is allowed.
Course organization: We follow the 'inverted lecture' model. Recorded lectures are to be viewed by students prior to the corresponding class discussion.
- Lectures: will be posted below as Zoom recordings & slide PDFs.
- Discussions (not recorded): TuTh 10:30-11:30 in Foege S-110. Discussion topics include the lecture, homework, and related material.
- The approximate syllabus is below; also see last year's site.
Instructors:
- Phil Green (phg [ a t ] uw.edu)
- TA: Cliff Rostomily (crosto [ a t ] uw.edu)
Office hours:
- Cliff: by appointment.
- Phil: by appointment.
Grading & homework policies:
- The entire course grade is based on the homework assignments, which are due weekly. No tests or exams.
- The homework assignments involve writing programs for data analysis (using a general-purpose programming language -- R and MATLAB are not suitable for these assignments!), and running them on a computer you have access to (we do not provide computers). We don't require a specific language, since it is not practical for us to grade your code, just the output from running your programs. However it is important to use a language that allows you to write programs that will run fast on large (genome-scale) datasets; ideally a compiled language such as C, C++, or Java. You may be able to get by with an interpreted language such as Python, however there is a definite risk that programs in such languages will take very long to run -- which means that you will need to get the program written early enough that you can allow 2 or 3 days for the run and still be able to turn in the assignment on time.
- Homework is due by 11:59 pm on the indicated date. Each assignment is worth 10 points, with a score of 7.0 equivalent to a 'B'. There is a late penalty of half a point for each day (or fraction thereof) past the deadline. You can redo the assignment and take the new score if you wish.
- Incorrectly formatted homework submissions will not be graded. They can be resubmitted, but a late penalty (as above) may apply.
- It is OK to run your program on someone else's input datafile, and compare outputs to see if you get the same results. However it is not OK to share programs, or to get someone else to debug your program. A key part of the course is learning to write and debug your own programs for data analysis.
- You can use 'generative AI' (such as ChatGPT, or GitHub Copilot) to help you with debugging and with translating from Python to C. However the program and output you submit must be your own (not generated by AI).

HOMEWORK ASSIGNMENTS:

Assignment 1, due Sunday Jan 14
Assignment 2, due Sunday Jan 21
Assignment 3, due Sunday Jan 28
Assignment 4, due Sunday Feb 4
Assignment 5, due Sunday Feb 11
Assignment 6, due Sunday Feb 18
Assignment 7, due Sunday Feb 25
Assignment 8, due Sunday Mar 3
Assignment 9, due Sunday Mar 10

LECTURE SLIDES & VIDEOS:

Lecture 1 Video (100 min). Slides (34). Text. Overviews of computational molecular biology, interpreting genomes, and this course.
Lecture 2 Video (72 min). Slides (22). Computation time issues; finding exact sequence matches using suffix arrays; hashtables.
Discussion 2 Slides
Lecture 3 Video (100 min). Slides (33). Probability models. 'Background' sequence models; failure of equal frequency & independence assumptions. Neutralist v selectionist interpretations. Markov models. Assessing significance of sequence patterns; simulations.
Discussion 3 Slides
Lecture 4 Video (78 min). Slides (24). Site models: Assumptions, construction, examples, failures.
Discussion 4 Slides
Lecture 5 Video (68 min). Slides (25). Comparing probability models. Likelihood ratios and the Neyman-Pearson lemma. Weight matrices. Score distributions.
Discussion 5 Slides
Lecture 6 Video (76 min). Slides (29). Average LRs and frequencies of site-like sequences in background. Relative entropies (average LLRs). Sequence logos.
Discussion 6 Slides
Lecture 7 Video (90 min). Slides (34). Directed acyclic graphs. Dynamic programming.
Discussion 7 Slides
Lecture 8 Video (73 min). Slides (20). Weighted linked lists as simple WDAGs. Sequence graphs & compositionally biassed regions. Defining and interpreting scores. Other applications.
Discussion 8 Slides
Lecture 9 Video (86 min). Slides (33). Sequence alignment, evolution, & mutations. Edit graphs & Smith-Waterman alignment. Multiple sequence alignment: higher dimensional edit graphs, progressive alignment.
Discussion 9 Slides
Lecture 10 Video (79 min). Slides (30). Local vs global alignments. Alignment scoring: background models for proteins, score matrices, profiles. Statistical significance.
Discussion 10 Slides
Lecture 11 Video (78 min). Slides (23). Gap penalties. Word nucleation algorithms. Genome alignment.
Discussion 11 Slides
Lecture 12 Video (65 min). Slides (24). More on WDAGs: inverted WDAGs, fwd/backwd algorithm, finding multiple high-scoring paths. Multiple paths in edit graphs: Finding repeats. Multiple paths in WLLs. D-segments.
Discussion 12 Slides
Lecture 13 Video (59 min). Slides (23). Hidden Markov Models: Intro & definitions, examples.
Discussion 13 Slides
Lecture 14 Video (56 min). Slides (18). HMM probability calculations: WDAG, Viterbi algorithm. 2-state HMMs & D-segments.
Discussion 14 Slides
Lecture 15 Video (72 min). Slides (25). Forward & forward/backward algorithms. Parameter estimation: Viterbi & Baum-Welch training.
Discussion 15 Slides
Lecture 16 Video (64 min). Slides (19). Evolutionary trees. Tree-based probabilities for aligned sequences.
Discussion 16 Slides
Lecture 17 Video (82 min). Slides (16). Detecting sequence conservation with PhyloHMMs; PhastCons.
Discussion 17 Slides
Lecture 18 Video (83 min). Slides (23). PhastCons.
Discussion 18 Slides
Lecture 19 (TBA: In-person lecture by Cliff).
Lecture 20 Video (72 min). Slides (33). Parsing genomes with HMMs.
Discussion 20 Slides

COURSE-RELATED MATERIALS:

Math Notation
Biological Review Slides: Gene and genome structure in prokaryotes and eukaryotes; the genetic code & codon usage; "global" genome organization. Sources and characteristics of sequence data; Genbank and other sequence databases.
Nature paper on human genome sequence
Nature paper on mouse genome sequence
Siepel et al. paper on PhyloHMMs & sequence conservation
Rabiner tutorial on HMMs
HMM scaling tutorial (Tobias Mann)

SUGGESTED TEXTS (not required):

Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids by Richard Durbin, S. Eddy, A. Krogh, G. Mitchison; Cambridge University Press, 1998. ISBN: 0521629713. www.amazon.com.
Statistical Methods in Bioinformatics : An Introduction (Statistics for Biology and Health) by Warren J. Ewens, Gregory R. Grant; Springer, 2005. ISBN: 0387400826. Note that this is the 2d edition!! www.amazon.com.

Genome 540

Introduction to Computational Molecular Biology:

Genome and Protein Sequence Analysis

(Winter Quarter 2024)

HOMEWORK ASSIGNMENTS:

LECTURE SLIDES & VIDEOS:

COURSE-RELATED MATERIALS:

SUGGESTED TEXTS (not required):

C/C++ PROGRAMMING GUIDES:

UW Terms and Conditions of Use

UW Online Privacy Statement