Genome 540 Homework Assignment 8

Due Sunday Mar 6

Write a program that implements an HMM for predicting genes in the genome sequence of Salmonella typhimurium (strain LT2). Specifically:
- The HMM has 11 states, numbered 1 through 11 (not counting the "begin" state 0). Each state emits observed symbols which are trinucleotides (rather than nucleotides); this is in order to allow us to model start and stop codons and the synonymous codon bias in coding sequences. Think of the original genome sequence, of length n, as being a sequence of n-2 (overlapping) trinucleotides: for example if the original sequence is AACGCT..., then the corresponding trinucleotides are AAC,ACG,CGC,GCT, ...
  - State 1 corresponds to the first base of a start codon for a top-strand gene. It emits ATG, CTG, GTG, or TTG, but no other trinucleotides (i.e. the emission probability for every trinucleotide not of the form NTG is 0).
  - State 2 corresponds to the first base of an internal codon for a top-strand gene. It emits any codon EXCEPT stop codons (i.e. TAA, TGA, and TAG have emission probabilities 0).
  - State 3 corresponds to the second base of a start or internal codon for a top-strand gene. It can emit any trinucleotide (a trinucleotide emitted by state 3 corresponds to the 2^d and 3^d bases of one codon and the first base of the next codon).
  - State 4 corresponds to the third base of a start or internal codon for a top-strand gene. It can emit any trinucleotide.
  - State 5 corresponds to the first base of a stop codon. It emits only stop codons (i.e. any trinucleotide not of the form TAA, TGA, or TAG has emission probability 0).
  - State 6 is the intergenic state. It can emit any trinucleotide.
  - States 7 thru 11 are the bottom strand versions of states 1 thru 5 (so they emit complementary trinucleotides to those emitted by states 1 thru 5).
  - The allowed transitions are as follows (you should make sure you understand why!):
    - 1 transitions only to 3 (i.e. a_1j = 0 unless j = 3)
    - 2 transitions only to 3
    - 3 transitions only to 4
    - 4 transitions only to 2 or to 5
    - 5 transitions only to 6
    - 6 transitions only to 6, or to 1, or to 11
    - 0 transitions only to 6 (i.e. the first base of the genome sequence is assumed intergenic)
    - other transitions involving states 7 thru 11: you figure these out!
  - Initial emission probabilities for each state should be 1/m, where m is the number of allowed trinucleotides for that state
  - Initial transition probabilities:
    - a₄₂ = .99, a₄₅ = .01
    - a₆₆ = .8, a₆₁ = .1, a_6,11 = .1
    - other cases determined by symmetry
Use Viterbi training as described in class to find improved parameter estimates. Run the training for 5 iterations, where for each iteration you:
1. Use dynamic programming to find the highest probability underlying state sequence.
2. Using this state sequence, compute
  - The numbers of top strand and bottom strand genes.
  - New transition probabilities and new emission probabilities, to be used in the next iteration.

Your output should provide

the name and first line of the .fna file
For each of the 5 iterations, the numbers of top strand and bottom strand genes given by the Viterbi parse, and the new transition and emission probabilities. Give probabilities to 5 decimal places. List all of probabilities, even when they are 0.

You must turn in your results and your computer program, using the template (file format) described below. Please put everything into ONE plain text file - do not send an archive of files or a tar file, or a word processing document file. Compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs let us know), and send it as an attachment to both Phil at phg@u.washington.edu, and Tobias at mann@gs.washington.edu. Here is the template. <gs540_hw assignment='8' name='student name' email='student email'> <results> <result type='first line' file='filename'> first line of the .fna file that you use for viterbi training. NOTE: please do not change the filename or the first line in any way. Please include all characters of the first line (including the '>' character). </result> <result type='viterbi iteration' iteration='<some number>'> <result type='gene histogram'> The gene histogram should give the number of top strand and bottom strand genes found at this iteration, and should be reported as below. top strand genes=<some number>,bottom strand genes=<some number> </result> <model type='hmm'> The model object should specify an HMM by giving state labels, initial state probabilities (= transition probabilities from the begin state), state transition probabilities, and symbol emission probabilities. <states> give your state labels, separated by commas: 1,2 </states> <initial_state_probabilities> initial state probabilities should give, for each state, the state label and probability of starting in that state (i.e. the probability of transitioning into that state from the begin state) separated by an equals sign. Entries should be separated by commas. 1=0.90000,2=0.10000 </initial_state_probabilities> <transition_probabilities state='1'> transition probabilities should give, for each state, the state label and probability of transitioning to that state from the state indicated in the attributes list. The present field (with state='1') gives the probabilities of transitioning from state 1 to states 1 and 2: 1=0.99000,2=0.01000 </transition_probabilities> <transition_probabilities state='2'> 1=0.20000,2=0.80000 </transition_probabilities> <emission_probabilities state='1'> For each symbol emitted by the state indicated in the attributes for this field, give the probability of emitting that symbol. TTT=.15000,TCT=.15000,TAT=.15000,TGT=.15000, ... </emission_probabilities> <emission_probabilities state='2'> TTT=.20000,TCT=.15000,TAT=.15000,TGT=.10000, ... </emission_probabilities> </model> </result> </results> <program> <comments> put comments about your code here </comments> <file> file contents here </file> </program> </gs540_hw>