Genome 540
Introduction to Computational Molecular Biology:
 Genome and Protein Sequence Analysis  
 (Winter Quarter 2006)  
 
 HOMEWORK ASSIGNMENTS:  
 SYLLABUS & LECTURE SLIDES: 
Math Notation
  Nature paper on Avida   ( Avida web site )
   Nature paper on human genome sequence 
   Nature paper on mouse genome sequence 
  Rabiner tutorial on HMMs
  HMM scaling tutorial (Tobias Mann) 
- Biological Review (1st discussion section): Gene and genome structure in prokaryotes and eukaryotes; the genetic code & codon usage; "global" genome organization. Sources and characteristics of sequence data; Genbank and other sequence databases.
 - Lecture 1: Living organisms as imperfect replication machines; theory of evolution & tree of life; 'artificial life'. Mutations as molecular basis for evolutionary change. 
 -  Lecture 2: Finding exact matches in sequences. CpG mutations/CpG islands. Large-scale mutational changes. Mutation fates.  Neutral theory, mutation & substitution rates. 
 -  Lecture 3: Substitution rates (cont'd). Overview of goals & experimental approaches of molecular biology; role of sequence analysis. Generalities on algorithms for biological data; directed graphs; depth structure of directed acyclic graphs (DAGs); trees and linked lists.
 -  Lecture 4: Dynamic programming on weighted DAGs. Maximal-scoring sequence segments; edit graphs & sequence alignment. Reading: Durbin et al. section 2.1, 2.2, 2.3. 
 -  Lecture 5:  Smith-Waterman algorithm, Needleman-Wunsch algorithm. Local vs. global.  Profiles; edge weight issues; linear space algorithms. Reading: Durbin et al. 2.4, 2.5, 2.6. 
 -  Lecture 6: General & affine gap penalties; hidden state DAGs; Smith-Waterman special cases, self-similarity. 
Speedups based on nucleating word matches: BLAST, FASTA, cross_match. 
 -  Lecture 7: Multiple sequence alignment. Probability models on sequences; review of basic probability theory: probability spaces, conditional probabilities, independence; failure of equal frequency assumption for DNA. Failure of equal frequency assumption for proteins. Reading: Durbin et al. 6.1, 6.2, 6.3; Ewens & Grant 1.1, 1.2, 1.12 
 -  Lecture 8: Site models; examples: 3' splice sites, 5' splice sites, protein motifs; limitations of site models (variable spacing, non-independence) -- splice site illustrations; site probability models. Comparing alternative models, hypothesis tests, likelihood ratio tests. Neyman-Pearson Lemma. Reading: Ewens & Grant 3.1, 3.2, 3.4, 3.6, 5.2, 9.1, 9.2 
 -  Lecture 9: Weight matrices for site models. weight matrices for splice sites in C. elegans, score distributions. Hidden Markov Models: introduction; formal definition; probabilities of sequences. HMM examples: site models, 2-state models. Reading: Ewens & Grant 5.3.1, 5.3.2, 12.1, 12.2, 12.3;  Durbin et al. chapter 3 
 -  Lecture 10: HMM examples: 2-state models, 7-state prokaryote genome model. Computing HMM probabilities via associated WDAG. Reading: Ewens & Grant 12.2, 12.3; Durbin et al. chapter 3 
 -  Lecture 11: HMM Parameter estimation: Viterbi training, Baum-Welch (EM) algorithm; specialized techniques. Multiple alignment via profile HMMs. Information theory: entropy, coding theory/data compression, uniquely decodable codes.  Reading: Ewens & Grant 1.14, Appendix B.10. 
 -  Lecture 12: Information theory: information inequality, Boltzmann distribution, Kraft inequality, entropy & expected code length. Information; relative entropy. Relative entropies of site models.  
 -  Lecture 13: Sequence logos. Random variables; exact probability distribution for weight matrix scores. Non-independence in background & compositional models.  Reading: Ewens & Grant 1.3.1, 1.3.2, 1.3.4, 1.4, 1.5, 2.10.1, 4.5, 4.6, 5.2, 5.3.3.  
 -  Lecture 14: Probability models of biological sequences, allowing dependencies. Order k Markov models; minimum description length principle; overfitting. Sparse probabilistic suffix trees. Reading: Ewens & Grant 5.3.4 
 -  Lecture 15: Gene identification in eukaryotes.
 -  Lecture 16: Gene identification in eukaryotes (cont'd).
 -  Lecture 17: Maximal scoring segments; D-segments; exact probability dist'ns for segment scores. Karlin-Altschul theory for high-scoring segments. Reading: Ewens & Grant chapter 7 
 -  Lecture 18: Karlin-Altschul theory (cont'd).
 -  Lecture 19: 'Non-random' mutations; transcription-coupled asymmetry.  paper  describing this work. 
 
 OTHER RELEVANT COURSES AT UW: 
 COMPUTATIONAL BIOLOGY COURSES AT OTHER SITES: