Genome 540/541
Introduction to Computational Molecular Biology:
Genome and Protein Sequence Analysis
(Winter/Spring Quarter 2004)
HOMEWORK ASSIGNMENTS (Winter Quarter):
SYLLABUS & LECTURE SLIDES FOR WINTER QUARTER:
Math Notation
Nature paper on Avida ( Avida web site )
Nature paper on human genome sequence
Nature paper on mouse genome sequence
-
Biology review
- Lecture 1: living organisms as imperfect replication machines; theory of evolution & tree of life; 'artificial life'; physical laws relevant to biological organisms; molecular structure
- Lecture 2: entropy and 2d law of thermodynamics, implications for living organisms; DNA, RNA & protein structure; gene expression & "Central Dogma"; the genetic code; codon usage. Reading: Krane & Raymer pp. 1-19. site with molecular biology graphics used in this lecture.
- Lecture 3: gene and genome structure in prokaryotes and eukaryotes; "global" genome organization. Reading: Krane & Raymer pp. 117-143.
- Lecture 4: GC content variation in mammalian genomes. Mutations as molecular basis for evolutionary change. CpG islands. Large-scale mutational changes. Mutation fates. Neutral theory, mutation & substitution rates. Reading: Krane & Raymer pp. 57-65.
- Lecture 5: Mutation & substitution rates (cont'd). Sources and characteristics of sequence data; Genbank and other sequence databases. Reading: Krane & Raymer pp. 19-27 ("Molecular Biology Tools").
-
Basic algorithms
- Lecture 6: overview of goals & experimental approaches of molecular biology; role of sequence analysis. Generalities on algorithms for biological data; finding exact matches in sequences; directed graphs; depth structure of directed acyclic graphs (DAGs); trees and linked lists.
- Lecture 7: Dynamic programming on weighted DAGs. Maximal-scoring sequence segments; edit graphs & sequence alignment (local & global): Smith-Waterman algorithm, Needleman-Wunsch algorithm. Reading: Krane & Raymer pp. 34-47.
- Lecture 8: Local vs global. Profiles; edge weight issues; linear space algorithms; hidden state DAGs; general & affine gap penalties; extending graphs to have "edge memory"; Smith-Waterman special cases, self-similarity.
- Lecture 9: Speedups based on nucleating word matches: BLAST, FASTA, cross_match. Multiple sequence alignment: biological considerations; product DAGs;projection-reduced complexity Reading: Krane & Raymer pp. 48-53.
-
Probability distributions on sets of sequences
- Lecture 10: Carillo-Lipman, Hein, & progressive alignment algorithms. Probability models on sequences; review of basic probability theory: probability spaces, conditional probabilities, independence; failure of equal frequency assumption for DNA, proteins. Reading: Ewens & Grant sections 1.1, 1.2, 1.12
- Lecture 11: Site models; examples: 3' splice sites, 5' splice sites, protein motifs; limitations of site models (variable spacing, non-independence) -- splice site illustrations; site probability models. Comparing alternative models, hypothesis tests, likelihood ratio tests. Neyman-Pearson Lemma, approximation to significance level for likelihood ratio test. Weight matrices for site models
Reading: Ewens & Grant 3.1, 3.2, 3.4, 8.1, 8.3
- Lecture 12: ; weight matrices for splice sites in C. elegans, score distributions. Hidden Markov Models: introduction; formal definition; probabilities of sequences. HMM examples: site models, 2-state models, 7-state prokaryote genome model. Reading: Ewens & Grant 5.2, 5.3.1, 5.3.2, 11.1
- Lecture 13: Computing HMM probabilities via associated WDAG. Parameter estimation: Viterbi training, Baum-Welch (EM) algorithm. Reading: Ewens & Grant 11.2, 11.3
- Lecture 14: Information theory: entropy, information inequality, Boltzmann distribution, coding theory/data compression, Kraft inequality, entropy & expected code length, uniquely decodable codes; information; relative entropy. Relative entropies of site models. Reading: Ewens & Grant 1.14, Appendix B.10.
- Lecture 15: Sequence logos. Random variables; exact probability distribution for weight matrix scores. Non-independence in background & compositional models. Reading: Ewens & Grant 1.3.1, 1.3.2, 1.3.3, 1.4, 1.5, 2.10.1
- Lecture 16: Order k Markov models; minimum description length principle. Gene identification in eukaryotes.
- Lecture 17: Gene identification in eukaryotes (cont'd).
- Lecture 18: Gene identification in eukaryotes (cont'd). 'Non-random' mutations; transcription-coupled asymmetry. paper describing this work.
- Lecture 19: Mutation asymmetry (cont'd). Karlin-Altschul theory for high-scoring segments. Reading: Ewens & Grant chapter 7
- Lecture 20: Karlin-Altschul theory (cont'd).
HOMEWORK ASSIGNMENTS (Spring Quarter):
SYLLABUS & LECTURE SLIDES FOR SPRING QUARTER:
- Lecture 21: Parsimony phylogeny methods. Fitch and Sankoff algorithms for counting steps on a given tree. Compatibility method and pairwise compatibility theorem. Reading: Krane & Raymer 78-86, 97-105; Ewens & Grant 385-387, 398-400; Durbin et al. 160-163, 173-176, 188-189
- Lecture 22: Searching tree space. Counting trees. Heuristic rearrangement, sequential addition, star decomposition, disk-covering, and quartet puzzling methods. Branch and bound for finding all most parsimonious trees. Reading: Krane & Raymer 81-83, 105-108; Ewens & Grant 398-400; Durbin et al. 163-165, 176-179
- Lecture 23: Distance methods. Least squares fitting of trees to
matrices of distances. Statistical behavior. Neighbor-joining
and UPGMA methods. Reading: Krane & Raymer 65-66, 85-93; Ewens & Grant 387-398; Durbin et al. 165-173, 189-191
- Lecture 24: Models of DNA change. Markov chain models for evolutionary change of DNA sequences along lineages of a phylogeny. Jukes-Cantor, Kimura K2P, F84, HKY, Tamura-Nei, General Time-Reversible, General 12-parameter models. Reading: Krane & Raymer 67-68; Ewens & Grant 365-382; Durbin et al. 193-197
- Lecture 25: Likelihood on phylogenies. Likelihoods for DNA models,
algorithm for computing likelihoods on a tree. HMM rate
variation again. Tree space with branch lengths. Likelihood ratio tests of molecular clocks, models of change. Reading: Krane & Raymer 93, 70-74; Ewens & Grant 400-408; Durbin et al. 197-209
- Lecture 26: Modelling variation of rates of evolution from site to site. Gamma distribution of rates, Hidden Markov Model for rates and Forward-Backward algorithm to evaluate likelihood for it. Hasegawa 14-species
232-site primate mitochondrial sequences Reading: Krane & Raymer 70-74 (again); Ewens & Grant 415-416; Durbin et al. 215-217
- Lecture 27: Bootstraps and jackknifes for resampling among sites to
get confidence intervals on the tree. What they do and
don't mean. Reading: Krane & Raymer 108-11; Ewens & Grant 408-415; Durbin et al. 179-180, 212-215
- Lecture 28: Coalescents: Trees of genes within species and what
happens to molecular evolution as one gets down near
the species level. Reading: Krane & Raymer 114; Durbin et al. 211-212
- Lecture 29: Markov Chain Monte Carlo methods. Using them to compute
likelihoods summing over all coalescent genealogies.
Griffiths/Tavare' and Kuhner et. al. approaches. Reading: Ewens & Grant 310-314; Durbin et al. 206-207, 211-212
- Lecture 30: Tree alignment. Why you ought not do multiple sequence
alignment without thinking about phylogenies at the same
time. Sankoff's strategy, Clustal and progressive
alignment as approximations. Probabilistic models of
insertion and deletion and prospects for ever getting
anywhere using them. Reading: Krane & Raymer 93-94; Durbin et al. 180-188
- Lectures 31-32 (Martin Tompa)
- Lectures 33-34 (including homework assignment) and Related papers (Hong Qian)
- Lectures 35-38 (Bill Noble)
- Lectures 39-40 (David Baker)
OTHER RELEVANT COURSES AT UW:
COMPUTATIONAL BIOLOGY COURSES AT OTHER SITES: