Genome 540/541
Introduction to Computational Molecular Biology:
Genome and Protein Sequence Analysis
(Winter/Spring Quarter 2003)
HOMEWORK ASSIGNMENTS (Winter Quarter):
SYLLABUS & LECTURE SLIDES FOR WINTER QUARTER:
Math Notation
Nature paper on human genome sequence
Nature paper on mouse genome sequence

Biology review
 Lecture 1: living organisms as imperfect replication machines; theory of evolution & tree of life; molecular structure; entropy and 2d law of thermodynamics, implications for living organisms
 Lecture 2: DNA, RNA & protein structure; gene expression & "Central Dogma"; the genetic code. Reading: Krane & Raymer pp. 119. site with molecular biology graphics used in this lecture.
 Lecture 3: codon usage; gene and genome structure in prokaryotes and eukaryotes. Reading: Krane & Raymer pp. 117143.
 Lecture 4: genome structure in eukaryotes: "global" genome organization, GC content variation in mammalian genomes. Mutations as molecular basis for evolutionary change. CpG islands. Largescale mutational changes. Mutation fates. Reading: Krane & Raymer pp. 5765.
 Lecture 5: Neutral theory, mutation & substitution rates. overview of goals & experimental approaches of molecular biology; role of sequence analysis.
 Lecture 6: Sources and characteristics of sequence data; Genbank and other sequence databases. Reading: Krane & Raymer pp. 1927 ("Molecular Biology Tools").

Basic algorithms
 Lecture 7: generalities on algorithms for biological data; finding exact matches in sequences; directed graphs; depth structure of directed acyclic graphs (DAGs); trees and linked lists; dynamic programming on weighted DAGs
 Lecture 8: dynamic programming on weighted DAGs (cont'd); maximalscoring sequence segments; edit graphs & sequence alignment (local & global): SmithWaterman algorithm, NeedlemanWunsch algorithm; local vs global. Reading: Krane & Raymer pp. 3447.
 Lecture 9: profiles; edge weight issues; linear space algorithms; hidden state DAGs; general & affine gap penalties; extending graphs to have "edge memory"
 Lecture 10: SmithWaterman special cases, selfsimilarity; speedups based on nucleating word matches: BLAST, FASTA, cross_match. Multiple sequence alignment: biological considerations; product DAGs Reading: Krane & Raymer pp. 4853.
 Lecture 11: CarilloLipman, Hein, & progressive alignment algorithms as special cases of projectionreduced complexity.

Probability distributions on sets of sequences
 Lecture 12: Probability models on sequences; review of basic probability theory: probability spaces, conditional probabilities, independence; failure of equal frequency assumption for DNA, proteins. Site models; example: 3' splice sites. Reading: Ewens & Grant sections 1.1, 1.2, 1.12
 Lecture 13: Site models; examples: 5' splice sites, protein motifs; limitations of site models (variable spacing, nonindependence)  splice site illustrations; site probability models. Comparing alternative models, hypothesis tests, likelihood ratio tests. NeymanPearson Lemma, approximation to significance level for likelihood ratio test.
Reading: Ewens & Grant 3.1, 3.2, 3.4, 8.1, 8.3
 Lecture 14: weight matrices for site models; weight matrices for splice sites in C. elegans, score distributions. Hidden Markov Models: introduction; formal definition; probabilities of sequences. Reading: Ewens & Grant 5.2, 5.3.1, 5.3.2, 11.1
 Lecture 15: HMM examples: site models, 2state models, 7state prokaryote genome model. Computing HMM probabilities via associated WDAG. Parameter estimation: Viterbi training, BaumWelch (EM) algorithm. Reading: Ewens & Grant 11.2, 11.3
 Lecture 16 (Bill Noble): Motifs. Reading: Ewens & Grant 10.5
 Lecture 17 (Bill Noble): Motifs (cont'd).
 Lecture 18 (Bill Noble): Motifs (cont'd). References
 Lecture 19: Information theory: entropy, information inequality, Boltzmann distribution, coding theory/data compression, Kraft inequality, entropy & expected code length, uniquely decodable codes; information; relative entropy. Relative entropies of site models; sequence logos. Reading: Ewens & Grant 1.14, Appendix B.10.
 Lecture 20: Gene identification in eukaryotes.
HOMEWORK ASSIGNMENTS (Spring Quarter):
SYLLABUS & LECTURE SLIDES FOR SPRING QUARTER:
 Lecture 21 notes and slides : Microarrays. Reading: Krane & Raymer pp. 143147.
 Lecture 22 notes and slides : Microarrays.
 Lecture 23 notes and slides : Microarrays.
 Lecture 24: 'Nonrandom' mutations; transcriptioncoupled asymmetry. paper describing this work.
 Lecture 25: Maximal scoring segments in weighted linked lists.
 Lecture 26: KarlinAltschul theory for highscoring segments. Reading: Ewens & Grant chapter 7
 Lecture 27 slides
 Lecture 28 slides and references
 Lecture 29: Parsimony phylogeny methods. Fitch and Sankoff algorithms for counting steps on a given tree. Compatibility method and pairwise compatibility theorem. Reading: Krane & Raymer 7886, 97105; Ewens & Grant 385387, 398400; Durbin et al. 160163, 173176, 188189
 Lecture 30: Searching tree space. Counting trees. Heuristic rearrangement, sequential addition, star decomposition, diskcovering, and quartet puzzling methods. Branch and bound for finding all most parsimonious trees. Reading: Krane & Raymer 8183, 105108; Ewens & Grant 398400; Durbin et al. 163165, 176179
 Lecture 31: Distance methods. Least squares fitting of trees to
matrices of distances. Statistical behavior. Neighborjoining
and UPGMA methods. Reading: Krane & Raymer 6566, 8593; Ewens & Grant 387398; Durbin et al. 165173, 189191
 Lecture 32: Models of DNA change. Markov chain models for evolutionary change of DNA sequences along lineages of a phylogeny. JukesCantor, Kimura K2P, F84, HKY, TamuraNei, General TimeReversible, General 12parameter models. Reading: Krane & Raymer 6768; Ewens & Grant 365382; Durbin et al. 193197
 Lecture 33: Likelihood on phylogenies. Likelihoods for DNA models,
algorithm for computing likelihoods on a tree. HMM rate
variation again. Tree space with branch lengths. Likelihood ratio tests of molecular clocks, models of change. Reading: Krane & Raymer 93, 7074; Ewens & Grant 400408; Durbin et al. 197209
 Lecture 34: Modelling variation of rates of evolution from site to site. Gamma distribution of rates, Hidden Markov Model for rates and ForwardBackward algorithm to evaluate likelihood for it. Hasegawa 14species
232site primate mitochondrial sequences Reading: Krane & Raymer 7074 (again); Ewens & Grant 415416; Durbin et al. 215217
 Lecture 35: Bootstraps and jackknifes for resampling among sites to
get confidence intervals on the tree. What they do and
don't mean. Reading: Krane & Raymer 10811; Ewens & Grant 408415; Durbin et al. 179180, 212215
 Lecture 36: Tree testing. Likelihood ratio tests, Paired sites tests
including KHT test, ShimodairaHasegawa multipletree
extension of it. Reading: Krane & Raymer 111112; Ewens & Grant 416421 ?
 Lecture 37: Coalescents: Trees of genes within species and what
happens to molecular evolution as one gets down near
the species level. Reading: Krane & Raymer 114; Durbin et al. 211212
 Lecture 38: Markov Chain Monte Carlo methods. Using them to compute
likelihoods summing over all coalescent genealogies.
Griffiths/Tavare' and Kuhner et. al. approaches. Reading: Ewens & Grant 310314; Durbin et al. 206207, 211212
 Lecture 39: Tree alignment. Why you ought not do multiple sequence
alignment without thinking about phylogenies at the same
time. Sankoff's strategy, Clustal and progressive
alignment as approximations. Probabilistic models of
insertion and deletion and prospects for ever getting
anywhere using them. Reading: Krane & Raymer 9394; Durbin et al. 180188
 Lecture 40: Genomic models: Gene families. Reconciled trees. Computing minimum numbers of inversions between genomes. Signed and unsigned maps. Breakpoint approximations. Prospects for taking DNA change into account too. Reading: Ewens & Grant ; Durbin et al.
OTHER RELEVANT COURSES AT UW:
COMPUTATIONAL BIOLOGY COURSES AT OTHER SITES: