Genome 540/541: Computational Molecular Biology

Genome 540/541

Introduction to Computational Molecular Biology:

Genome and Protein Sequence Analysis

(Winter/Spring Quarter 2005)

Synopsis: A two-quarter introduction to protein and DNA sequence analysis and molecular evolution, including a brief review of basic molecular biology (structure & evolution of genes, genomes and proteins), probabilistic models of sequences and of sequence evolution, computational gene identification, pairwise sequence comparison and alignment (algorithms and statistical issues), multiple sequence alignment and evolutionary tree construction, comparative genomics, and protein sequence/structure relationships. These are the central computational methods required to determine the "periodic table of biology", i.e. the list of proteins and their evolutionary relationships, which can be regarded as the first stage in the growth of molecular biology into a quantitative science. Moreover, the statistical and algorithmic methods used (which include maximum likelihood estimation, hidden Markov models, dynamic programming) have wide applicability in other areas of computational & mathematical biology.
Prerequisites: Ability to write computer programs for data analysis is essential (homework assignments will require this.) Some prior exposure to probability and statistics, and molecular biology, is highly desirable but not essential.
Lectures: TuTh 10:30-11:50 in J-280 (Health Sciences Complex). We start promptly at 10:30.
- Review/Homework discussion session: Thurs 12-1 in I-132 (just off the Rotunda cafe), starting Jan 13.
Instructors:
- (Winter) Phil Green (K-343B; phg (at) u.washington.edu)
  - TA: Tobias Mann (mann (at) gs.washington.edu)
- (Spring) Joe Felsenstein (J-253; joe (at) gs.washington.edu)
- (Spring) Bill Noble (J-205; noble (at) gs.washington.edu)
Office hours: by appointment with the instructor (just ask!)
Text:
- (Required): Fundamental Concepts of Bioinformatics by Dan E. Krane, Michael L. Raymer; Benjamin Cummings 2002; ISBN: 0805346333. Paperback, ~$82 (available from UW Bookstore--South Campus Center Branch, or www.bn.com, www.amazon.com).
- (Required): Statistical Methods in Bioinformatics : An Introduction (Statistics for Biology and Health) by Warren J. Ewens, Gregory R. Grant; Springer, 2005. ISBN: 0387400826. NOTE THAT THIS IS THE (NEW) 2d EDITION OF THE TEXT USED IN PREVIOUS YEARS -- MAKE SURE YOU GET THIS EDITION!! Hardbound, ~$90 (available from UW Bookstore--South Campus Center Branch, or www.bn.com, www.amazon.com).
- (Recommended): Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids by Richard Durbin, S. Eddy, A. Krogh, G. Mitchison; Cambridge University Press, 1998. ISBN: 0521629713. Paperback, ~$35.
To register for credit, contact Brian Giebel (bgiebel (at) u.washington.edu) to obtain an entry code. Enrollment is limited to 25 students. Auditing is allowed. N.B. ALL 25 ENTRY CODES HAVE NOW BEEN GIVEN OUT -- YOU MUST EMAIL ME AT phg (at) u.washington.edu TO SEE IF ANY SPACE IS AVAILABLE.
See last year's site for approximate syllabus
Grading & homework policies:
- The entire course grade is based on the homework assignments, which are due weekly (more or less). No tests or exams.
- The homework assignments involve writing programs for data analysis, and running them on a computer you have access to (we cannot provide computers). We don't require a specific language, since it is not practical for me to grade your code, just the output from running your programs. However it is important to use a language that allows you to write programs that will run fast on large datasets; ideally a compiled language such as C or C++. You may be able to get by with an interpreted language such as Perl or Java (some people have), however there is a definite risk that programs in these languages will take very long to run -- which means that (i) you will need access to a very fast computer to run them on, and/or (ii) you will need to get the program written early enough that you can allow 2 or 3 days for the actual run and still be able to turn in the assignment on time.
- Homework is due before midnight on the indicated date. After that it will be accepted, but penalized.
- It is OK to run your program on someone else's input datafile, and compare outputs to see if you get the same results. However it is not OK to share programs, or to get someone else to debug your program. A key part of the course is being able to write and debug your own programs for data analysis.
Since important announcements (e.g. schedule changes) may be made by email in advance of lectures, all attendees (whether or not registered for the course) should send their email addresses to Phil Green at the above address.
Message Board Instructions for using the message board.

HOMEWORK ASSIGNMENTS (Winter Quarter):

Assignment 1, due Sunday Jan. 16
Assignment 2, due Sunday Jan. 23
Assignment 3, due Sunday Jan. 30
Assignment 4, due Sunday Feb. 6
Assignment 5, due Sunday Feb. 13
Assignment 6, due Sunday Feb. 20
Assignment 7, due Sunday Feb. 27
Assignment 8, due Sunday Mar. 6

SYLLABUS & LECTURE SLIDES FOR WINTER QUARTER:

Math Notation

Nature paper on Avida ( Avida web site )

Nature paper on human genome sequence

Nature paper on mouse genome sequence

Lecture 1: Living organisms as imperfect replication machines; theory of evolution & tree of life; 'artificial life'.
Lecture 2: Finding exact matches in sequences. Mutations as molecular basis for evolutionary change. CpG islands. Large-scale mutational changes. Reading: Krane & Raymer pp. 1-19, 57-65.
Lecture 3: Mutation fates. Neutral theory, mutation & substitution rates. Gene structure in prokaryotes. Reading: Krane & Raymer pp. 117-143.
Lecture 4: Gene and genome structure in prokaryotes and eukaryotes; the genetic code & codon usage; "global" genome organization. Reading: Krane & Raymer pp. 117-143.
Lecture 5: Sources and characteristics of sequence data; Genbank and other sequence databases. Reading: Krane & Raymer pp. 19-27 ("Molecular Biology Tools").
Lecture 6: Overview of goals & experimental approaches of molecular biology; role of sequence analysis. Generalities on algorithms for biological data; directed graphs; depth structure of directed acyclic graphs (DAGs); trees and linked lists. Dynamic programming on weighted DAGs.
Lecture 7: Dynamic programming on weighted DAGs (cont'd). Maximal-scoring sequence segments; edit graphs & sequence alignment (local & global): Smith-Waterman algorithm, Needleman-Wunsch algorithm. Reading: Krane & Raymer pp. 34-47.
Lecture 8: Local vs global. Profiles; edge weight issues; linear space algorithms; hidden state DAGs; general & affine gap penalties.
Lecture 9: Smith-Waterman special cases, self-similarity. Speedups based on nucleating word matches: BLAST, FASTA, cross_match. Multiple sequence alignment: biological considerations. Probability models on sequences; review of basic probability theory: probability spaces, conditional probabilities, independence; failure of equal frequency assumption for DNA. Reading: Krane & Raymer pp. 48-53. Ewens & Grant sections 1.1, 1.2, 1.12
Lecture 10: Failure of equal frequency assumption for proteins. Site models; examples: 3' splice sites, 5' splice sites, protein motifs; limitations of site models (variable spacing, non-independence) -- splice site illustrations; site probability models. Comparing alternative models, hypothesis tests, likelihood ratio tests. Neyman-Pearson Lemma. Reading: Ewens & Grant 3.1, 3.2, 3.4, 3.6, 9.1, 9.2
Lecture 11: Neyman-Pearson Lemma, approximation to significance level for likelihood ratio test. Weight matrices for site models. weight matrices for splice sites in C. elegans, score distributions. Hidden Markov Models: introduction; formal definition; probabilities of sequences. HMM examples: site models, 2-state models, 7-state prokaryote genome model. Reading: Ewens & Grant 5.2, 5.3.1, 5.3.2, 12.1
Lecture 12: HMM examples: 2-state models, 7-state prokaryote genome model. Computing HMM probabilities via associated WDAG. Parameter estimation: Viterbi training, Baum-Welch (EM) algorithm. Reading: Ewens & Grant 12.2, 12.3
Lecture 13: HMM parameter estimation: specialized techniques. Multiple alignment via profile HMMs. Information theory: entropy, information inequality, Boltzmann distribution, coding theory/data compression, Kraft inequality, entropy & expected code length, uniquely decodable codes; Reading: Ewens & Grant 1.14, Appendix B.10.
Lecture 14: Sequence logos. Information; relative entropy. Relative entropies of site models. Random variables; exact probability distribution for weight matrix scores. Non-independence in background & compositional models. Reading: Ewens & Grant 1.3.1, 1.3.2, 1.3.4, 1.4, 1.5, 2.10.1
Lecture 15: Non-independence in background & compositional models. Order k Markov models; minimum description length principle. Gene identification in eukaryotes. Reading: Ewens & Grant 4.5, 4.6, 5.2, 5.3.3, 5.3.4
Lecture 16: Gene identification in eukaryotes (cont'd).
Lecture 17: Gene identification in eukaryotes (cont'd).
Lecture 18: Gene identification in eukaryotes (cont'd). 'Non-random' mutations; transcription-coupled asymmetry. paper describing this work.
Lecture 19: Mutation asymmetry (cont'd). Karlin-Altschul theory for high-scoring segments. Reading: Ewens & Grant chapter 7
Lecture 20: Karlin-Altschul theory (cont'd).

HOMEWORK ASSIGNMENTS (Spring Quarter):

Homework 3, due Friday April 22. Tree file for Homework 3.
Homework 4, due Friday April 29. Data set for Homework 4.
Homework 5, due Friday May 6.
Homework 6, due Friday May 13.
Homework 7, due Friday May 20.
Homework 8, due Wednesday June 8.

SYLLABUS & LECTURE SLIDES FOR SPRING QUARTER:

Lectures 21-23 and HW assignments 1 & 2 (Bill Noble)
Lecture 24: Parsimony phylogeny methods. Fitch and Sankoff algorithms for counting steps on a given tree. Compatibility method and pairwise compatibility theorem. Reading: Krane & Raymer 78-86, 97-105; Ewens & Grant 497-499, 511-512; Durbin et al. 160-163, 173-176, 188-189
Lecture 25: Searching tree space. Counting trees. Heuristic rearrangement, sequential addition, star decomposition, disk-covering, and quartet puzzling methods. Branch and bound for finding all most parsimonious trees. Reading: Krane & Raymer 81-83, 105-108; Ewens & Grant 511-512; Durbin et al. 163-165, 176-179
Lecture 26: Distance methods. Least squares fitting of trees to matrices of distances. Statistical behavior. Neighbor-joining and UPGMA methods. Reading: Krane & Raymer 65-66, 85-93; Ewens & Grant 499-511; Durbin et al. 165-173, 189-191
Lecture 27: Models of DNA change. Markov chain models for evolutionary change of DNA sequences along lineages of a phylogeny. Jukes-Cantor, Kimura K2P, F84, HKY, Tamura-Nei, General Time-Reversible, General 12-parameter models. Reading: Krane & Raymer 67-68; Ewens & Grant 475-496; Durbin et al. 193-197
Lecture 28: Likelihood on phylogenies. Likelihoods for DNA models, algorithm for computing likelihoods on a tree. HMM rate variation again. Tree space with branch lengths. Likelihood ratio tests of molecular clocks, models of change. Reading: Krane & Raymer 93, 70-74; Ewens & Grant 512-516; Durbin et al. 197-209
Lecture 29: Modelling variation of rates of evolution from site to site. Gamma distribution of rates, Hidden Markov Model for rates and Forward-Backward algorithm to evaluate likelihood for it. Hasegawa 14-species 232-site primate mitochondrial sequences Reading: Krane & Raymer 70-74 (again); Ewens & Grant 409-416, 529; Durbin et al. 215-217
Lecture 30: Bootstraps and jackknifes for resampling among sites to get confidence intervals on the tree. What they do and don't mean. Reading: Krane & Raymer 108-11; Ewens & Grant 522-530; Durbin et al. 179-180, 212-215
Lecture 31: Coalescents: Trees of genes within species and what happens to molecular evolution as one gets down near the species level. Reading: Krane & Raymer 114; Durbin et al. 211-212
Lecture 32: Markov Chain Monte Carlo methods. Using them to compute likelihoods summing over all coalescent genealogies. Griffiths/Tavare' and Kuhner et. al. approaches. Reading: Ewens & Grant 392-398; Durbin et al. 206-207, 211-212
Lecture 33: Tree alignment. Why you ought not do multiple sequence alignment without thinking about phylogenies at the same time. Sankoff's strategy, Clustal and progressive alignment as approximations. Probabilistic models of insertion and deletion and prospects for ever getting anywhere using them. Reading: Krane & Raymer 93-94; Durbin et al. 180-188
Week 8: Lectures 34-35 (David Baker)
Week 9: Lecture 36 (Barbara Trask)
Week 9: Lecture 37 (TBA)
Week 10: Lectures 38-39 (Larry Ruzzo).

Genome 540/541

Introduction to Computational Molecular Biology:

Genome and Protein Sequence Analysis

(Winter/Spring Quarter 2005)

HOMEWORK ASSIGNMENTS (Winter Quarter):

SYLLABUS & LECTURE SLIDES FOR WINTER QUARTER:

HOMEWORK ASSIGNMENTS (Spring Quarter):

SYLLABUS & LECTURE SLIDES FOR SPRING QUARTER:

OTHER RELEVANT COURSES AT UW:

COMPUTATIONAL BIOLOGY COURSES AT OTHER SITES: