Genome 540: Computational Molecular Biology

Genome 540

Introduction to Computational Molecular Biology:

Genome and Protein Sequence Analysis

(Winter Quarter 2006)

Synopsis: Together with Genome 541, a two-quarter introduction to protein and DNA sequence analysis and molecular evolution, including probabilistic models of sequences and of sequence evolution, computational gene identification, pairwise sequence comparison and alignment (algorithms and statistical issues), multiple sequence alignment and evolutionary tree construction, comparative genomics, and protein sequence/structure relationships. These are the central computational methods required to determine the "periodic table of biology", i.e. the list of proteins and their evolutionary relationships, which can be regarded as the first stage in the growth of molecular biology into a quantitative science. Moreover, the statistical and algorithmic methods used (which include maximum likelihood estimation, hidden Markov models, dynamic programming) have wide applicability in other areas of computational & mathematical biology.
Prerequisites: Ability to write computer programs for data analysis is essential (homework assignments will require this.) Some prior exposure to probability and statistics, and molecular biology, is highly desirable.
Lectures: TuTh 10:30-11:50 in J-280 (Health Sciences Complex). We start promptly at 10:30.
- Review/Homework discussion session: Tuesdays 12-1 in J-280 (for the week of Jan 9 ONLY, it will be Thursday 12-1 in J-280)
Instructors:
- Phil Green (K-343B; phg (at) u.washington.edu)
  - TA: Aaron Klammer (aklammer (at) u.washington.edu)
Office hours: Thursdays 12-1 (for Aaron), or by appointment (for either Aaron or Phil -- just ask!)
Text:
- (Required): Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids by Richard Durbin, S. Eddy, A. Krogh, G. Mitchison; Cambridge University Press, 1998. ISBN: 0521629713. Paperback, ~$35 (available from UW Bookstore--South Campus Center Branch, or www.bn.com, www.amazon.com).
- (Required): Statistical Methods in Bioinformatics : An Introduction (Statistics for Biology and Health) by Warren J. Ewens, Gregory R. Grant; Springer, 2005. ISBN: 0387400826. NOTE THAT THIS IS THE (NEW) 2d EDITION OF THE TEXT USED IN PREVIOUS YEARS -- MAKE SURE YOU GET THIS EDITION!! Hardbound, ~$90 (available from UW Bookstore--South Campus Center Branch, or www.bn.com, www.amazon.com).
To register for credit, contact Brian Giebel (bgiebel (at) u.washington.edu) to obtain an entry code. Enrollment is limited to 25 students. Auditing is allowed.
See last year's site for approximate syllabus
Grading & homework policies:
- The entire course grade is based on the homework assignments, which are due weekly (more or less). No tests or exams.
- The homework assignments involve writing programs for data analysis, and running them on a computer you have access to (we cannot provide computers). We don't require a specific language, since it is not practical for us to grade your code, just the output from running your programs. However it is important to use a language that allows you to write programs that will run fast on large datasets; ideally a compiled language such as C or C++. You may be able to get by with an interpreted language such as Perl or Java (some people have), however there is a definite risk that programs in these languages will take very long to run -- which means that (i) you will need access to a very fast computer to run them on, and/or (ii) you will need to get the program written early enough that you can allow 2 or 3 days for the actual run and still be able to turn in the assignment on time.
- Homework is due by 11:59 pm on the indicated date. After that it will be accepted, but penalized. Specifically, each assignment is worth 10 points, from which 1 point will be deducted for each day (or fraction thereof) that you turn it in late. The maximum deduction for being late is 6 points (even if you are more than 6 days late). If you get less than 4 points on an assignment, you are allowed to redo it and take the new score (which will be 4, i.e. 10 - 6, if there are no mistakes).
- It is OK to run your program on someone else's input datafile, and compare outputs to see if you get the same results. However it is not OK to share programs, or to get someone else to debug your program. A key part of the course is being able to write and debug your own programs for data analysis.
Since important announcements (e.g. schedule changes) may be made by email in advance of lectures, all attendees (whether or not registered for the course) should send their email addresses to Phil Green at the above address.

HOMEWORK ASSIGNMENTS:

Assignment 1, due Saturday Jan. 14
Assignment 2, due Saturday Jan. 21
Assignment 3, due Saturday Jan. 28
Assignment 4, due Saturday Feb. 4
Assignment 5, due Saturday Feb. 11
Assignment 6, due Saturday Feb. 18
Assignment 7, due Wednesday Mar. 1
Assignment 8, due Sunday Mar. 12

SYLLABUS & LECTURE SLIDES:

Math Notation

Nature paper on Avida ( Avida web site )

Nature paper on human genome sequence

Nature paper on mouse genome sequence

Rabiner tutorial on HMMs

HMM scaling tutorial (Tobias Mann)

Biological Review (1st discussion section): Gene and genome structure in prokaryotes and eukaryotes; the genetic code & codon usage; "global" genome organization. Sources and characteristics of sequence data; Genbank and other sequence databases.
Lecture 1: Living organisms as imperfect replication machines; theory of evolution & tree of life; 'artificial life'. Mutations as molecular basis for evolutionary change.
Lecture 2: Finding exact matches in sequences. CpG mutations/CpG islands. Large-scale mutational changes. Mutation fates. Neutral theory, mutation & substitution rates.
Lecture 3: Substitution rates (cont'd). Overview of goals & experimental approaches of molecular biology; role of sequence analysis. Generalities on algorithms for biological data; directed graphs; depth structure of directed acyclic graphs (DAGs); trees and linked lists.
Lecture 4: Dynamic programming on weighted DAGs. Maximal-scoring sequence segments; edit graphs & sequence alignment. Reading: Durbin et al. section 2.1, 2.2, 2.3.
Lecture 5: Smith-Waterman algorithm, Needleman-Wunsch algorithm. Local vs. global. Profiles; edge weight issues; linear space algorithms. Reading: Durbin et al. 2.4, 2.5, 2.6.
Lecture 6: General & affine gap penalties; hidden state DAGs; Smith-Waterman special cases, self-similarity. Speedups based on nucleating word matches: BLAST, FASTA, cross_match.
Lecture 7: Multiple sequence alignment. Probability models on sequences; review of basic probability theory: probability spaces, conditional probabilities, independence; failure of equal frequency assumption for DNA. Failure of equal frequency assumption for proteins. Reading: Durbin et al. 6.1, 6.2, 6.3; Ewens & Grant 1.1, 1.2, 1.12
Lecture 8: Site models; examples: 3' splice sites, 5' splice sites, protein motifs; limitations of site models (variable spacing, non-independence) -- splice site illustrations; site probability models. Comparing alternative models, hypothesis tests, likelihood ratio tests. Neyman-Pearson Lemma. Reading: Ewens & Grant 3.1, 3.2, 3.4, 3.6, 5.2, 9.1, 9.2
Lecture 9: Weight matrices for site models. weight matrices for splice sites in C. elegans, score distributions. Hidden Markov Models: introduction; formal definition; probabilities of sequences. HMM examples: site models, 2-state models. Reading: Ewens & Grant 5.3.1, 5.3.2, 12.1, 12.2, 12.3; Durbin et al. chapter 3
Lecture 10: HMM examples: 2-state models, 7-state prokaryote genome model. Computing HMM probabilities via associated WDAG. Reading: Ewens & Grant 12.2, 12.3; Durbin et al. chapter 3
Lecture 11: HMM Parameter estimation: Viterbi training, Baum-Welch (EM) algorithm; specialized techniques. Multiple alignment via profile HMMs. Information theory: entropy, coding theory/data compression, uniquely decodable codes. Reading: Ewens & Grant 1.14, Appendix B.10.
Lecture 12: Information theory: information inequality, Boltzmann distribution, Kraft inequality, entropy & expected code length. Information; relative entropy. Relative entropies of site models.
Lecture 13: Sequence logos. Random variables; exact probability distribution for weight matrix scores. Non-independence in background & compositional models. Reading: Ewens & Grant 1.3.1, 1.3.2, 1.3.4, 1.4, 1.5, 2.10.1, 4.5, 4.6, 5.2, 5.3.3.
Lecture 14: Probability models of biological sequences, allowing dependencies. Order k Markov models; minimum description length principle; overfitting. Sparse probabilistic suffix trees. Reading: Ewens & Grant 5.3.4
Lecture 15: Gene identification in eukaryotes.
Lecture 16: Gene identification in eukaryotes (cont'd).
Lecture 17: Maximal scoring segments; D-segments; exact probability dist'ns for segment scores. Karlin-Altschul theory for high-scoring segments. Reading: Ewens & Grant chapter 7
Lecture 18: Karlin-Altschul theory (cont'd).
Lecture 19: 'Non-random' mutations; transcription-coupled asymmetry. paper describing this work.

Genome 540

Introduction to Computational Molecular Biology:

Genome and Protein Sequence Analysis

(Winter Quarter 2006)

HOMEWORK ASSIGNMENTS:

SYLLABUS & LECTURE SLIDES:

OTHER RELEVANT COURSES AT UW:

COMPUTATIONAL BIOLOGY COURSES AT OTHER SITES: