Genome 540
Introduction to Computational Molecular Biology:
Genome and Protein Sequence Analysis
(Winter Quarter 2021)
- Synopsis: Together with Genome 541, a two-quarter
introduction to protein and DNA
sequence analysis and molecular evolution, including probabilistic models of sequences and of sequence
evolution, computational gene identification, pairwise sequence
comparison and alignment (algorithms and statistical issues), multiple
sequence alignment and evolutionary tree construction, comparative
genomics, and protein sequence/structure relationships.
These are the central
computational methods required to determine the "periodic table of
biology", i.e. the list of proteins and their evolutionary
relationships, which can be regarded as the first stage in the growth of
molecular biology into a quantitative science. Moreover, the
statistical and algorithmic methods used (which include maximum
likelihood estimation, hidden Markov models, dynamic programming) have
wide applicability in other areas of computational & mathematical
biology.
- Prerequisites: Ability to write computer programs for data
analysis in a general purpose programming language (R and MATLAB are not suitable). Some
prior exposure to probability and statistics, and molecular biology, is
desirable but not essential.
- Lectures: will be available as Zoom recordings one day prior to the corresponding discussion.
- Discussions: TuTh 10:30-11:50 via Zoom.
- (Zoom links will be emailed to registered attendees and auditors by the instructors.).
- To register for credit, contact the instructor (Phil Green, phg [ a t ] uw.edu) for permission, and Brian Giebel (bgiebel (at) uw.edu) to obtain an entry code. Enrollment is limited to 25 students. Auditing is allowed.
- Instructors:
- Phil Green (phg [ a t ] uw.edu)
- TA: Dani Faivre (dfaivre [ a t ] uw.edu)
- Office hours:
- Dani: by appointment.
- Phil: by appointment.
- Grading & homework policies:
- The entire course grade is based on the homework assignments, which are
due weekly.
No tests or exams.
- The homework assignments involve writing programs for data analysis (using a general-purpose programming language -- R and MATLAB are not suitable for these assignments!), and running them on a computer you have access to (we cannot provide computers). We don't require a specific language, since it is not
practical for us to grade your code, just the output from running your
programs. However it is important to use a language that allows you to
write programs that will run fast on large datasets; ideally a compiled
language such as C or C++. You may be able to get by with an interpreted
language such as Perl or Python (some people have), however there is a
definite risk that programs in these languages will take very long to
run -- which means that (i) you will need access to a very fast
computer to run them on, and/or (ii) you will need to get the program
written early enough that you can allow 2 or 3 days for the actual run and
still be able to turn in the assignment on time.
- Homework is due by 11:59 pm on the indicated date. After that it will be accepted, but penalized. Specifically, each assignment is worth 10 points, from which 1 point will be deducted for each day (or fraction thereof) that you turn it in late. The maximum deduction for being late is 3 points (even if you are more than 3 days late). If you get less than 7 points on an assignment, you are allowed to redo it and take the new score (which will be 7, i.e. 10 - 3, if there are no mistakes).
- It is OK to run your program on someone else's input datafile, and
compare outputs to see if you get the same results. However it is not OK
to share programs, or to get someone else to debug your program. A key
part of the course is being able to write and debug your own programs for
data analysis.
- Recommended texts:
- Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic
Acids
by Richard Durbin, S. Eddy, A. Krogh, G. Mitchison; Cambridge University
Press, 1998. ISBN: 0521629713. www.amazon.com.
- Statistical Methods in Bioinformatics : An Introduction (Statistics for Biology and Health)
by Warren J. Ewens, Gregory R. Grant; Springer, 2005. ISBN: 0387400826. Note that this is the 2d edition!!
www.amazon.com.
- See last year's site for approximate syllabus
HOMEWORK ASSIGNMENTS:
Assignment 1, due Sunday Jan 17
Assignment 2, due Sunday Jan 24
Assignment 3, due Sunday Jan 31
Assignment 4, due Sunday Feb 7
Assignment 5 , due Sunday Feb 14
Assignment 6 , due Sunday Feb 21
Assignment 7 , due Sunday Feb 28
Assignment 8 , due Sunday Mar 7
Assignment 9 , due Sunday Mar 14
LECTURE SLIDES & VIDEOS:
Math Notation
- Biological Review : Gene and genome structure in prokaryotes and eukaryotes; the genetic code & codon usage; "global" genome organization. Sources and characteristics of sequence data; Genbank and other sequence databases.
- Lecture 0 Video (22 min).
Slides (7). Course resources & requirements.
- Lecture 1.1 Video (65 min).
Slides (22). Overviews of computational molecular biology, genome biology & genomicists' tasks, and this course.
- Lecture 1.2 Video (55 min).
Slides (15). Algorithm generalities; finding exact matches using suffix arrays; hashtables.
- Lecture 2 Video (75 min).
Slides (21). Probability models for sequences. Failure of equal frequency & independence assumptions. Neutralist v selectionist interpretations. Reading: Durbin et al. section 1.3, 11.1; Ewens & Grant 1.1, 1.2, 1.12, 3.1, 3.2, 3.4, 3.6, 5.2, 9.1, 9.2
- Week 1 discussion Notebook. Slides.
- Lecture 3 Video (100 min).
Slides (37). Background and site models for DNA and protein sequences.
- Lecture 4 Video (68 min).
Slides (25). Comparing probability models. Likelihood ratios and the Neyman-Pearson lemma. Weight matrices. Score distributions.
- Week 2 discussion Notebook. Slides.
- Lecture 5 Video (76 min).
Slides (29). Average LRs and frequencies of site-like sequences in background. Relative entropies (average LLRs). Sequence logos.
- Lecture 6 Video (84 min).
Slides (36). Algorithmic complexity. Directed graphs, DAGs. Dynamic programming to find highest weight paths in WDAGs.
- Week 3 discussion Slides.
- Lecture 7 Video (94 min).
Slides (31). Weighted linked lists: Applications to sequences; finding multiple high-scoring paths; D-segments; statistical significance.
- Lecture 8 Video (87 min).
Slides (30). Sequence alignment, evolution, & mutations. Edit graphs & Smith-Waterman alignment. Local vs global.
- Week 4 discussion Slides.
- Lecture 9 Video (67 min).
Slides (27). Improved scoring: affine gap penalties, profiles. Statistical significance. Reducing time via word nucleation.
- Lecture 10 Video (64 min).
Slides (20). Reducing memory: linear space algorithms. Finding internal repeats. Genome alignment.
- Week 5 discussion Slides.
- Lecture 11 Video (80 min).
Slides (28). Alignment errors; gap attraction. Multiple sequence alignment: evolutionary trees, higher dimensional edit graphs, progressive alignment.
- Lecture 12 Video (59 min).
Slides (23). Hidden Markov Models: Intro & definitions, examples.
- Week 6 discussion Notebook. Slides.
- Lecture 13 Video (56 min).
Slides (18). HMM probability calculations: WDAG, Viterbi algorithm. 2-state HMMs & D-segments.
- Lecture 14 Video (72 min).
Slides (25). Forward & forward/backward algorithms. Parameter estimation: Viterbi & Baum-Welch training.
- Week 7 discussion Notebook. Slides.
- Lecture 15 Video (70 min).
Slides (20). Detecting sequence conservation with PhyloHMMs; PhastCons.
- Lecture 16 Video (105 min).
Slides (28). PhastCons.
- Week 8 discussion Slides.
- Lecture 17 Video (72 min).
Slides (33). Parsing genomes with HMMs.
- Week 9 discussion Slides.
COURSE-RELATED PAPERS:
C/C++ PROGRAMMING GUIDES: