Genome 540 Homework Assignment 9

Due Sunday Mar 14

Write a program that implements an HMM for predicting genes in the genome sequence you used in assignments 1-4. Specifically:
- The HMM has 11 states, numbered 1 through 11 (not counting the "begin" state 0). Each state emits observed symbols which are trinucleotides (rather than nucleotides); this is in order to allow us to model start and stop codons and the synonymous codon bias in coding sequences. Think of the original genome sequence, of length n, as being a sequence of n-2 (overlapping) trinucleotides: for example if the original sequence is AACGCT..., then the corresponding trinucleotides are AAC,ACG,CGC,GCT, ...
  - State 1 corresponds to the first base of a start codon for a top-strand gene. It emits ATG, CTG, GTG, or TTG, but no other trinucleotides (i.e. the emission probability for every trinucleotide not of the form NTG is 0).
  - State 2 corresponds to the first base of an internal codon for a top-strand gene. It emits any codon EXCEPT stop codons (i.e. TAA, TGA, and TAG have emission probabilities 0).
  - State 3 corresponds to the second base of a start or internal codon for a top-strand gene. It can emit any trinucleotide (a trinucleotide emitted by state 3 corresponds to the 2^d and 3^d bases of one codon and the first base of the next codon).
  - State 4 corresponds to the third base of a start or internal codon for a top-strand gene. It can emit any trinucleotide.
  - State 5 corresponds to the first base of a stop codon. It emits only stop codons (i.e. any trinucleotide not of the form TAA, TGA, or TAG has emission probability 0).
  - State 6 is the intergenic state. It can emit any trinucleotide.
  - States 7 thru 11 are the bottom strand versions of states 1 thru 5 (so they emit complementary trinucleotides to those emitted by states 1 thru 5).
  - The allowed transitions are as follows (you should make sure you understand why!):
    - 1 transitions only to 3 (i.e. a_1j = 0 unless j = 3)
    - 2 transitions only to 3
    - 3 transitions only to 4
    - 4 transitions only to 2 or to 5
    - 5 transitions only to 6
    - 6 transitions only to 6, or to 1, or to 11
    - 0 transitions only to 6 (i.e. the first base of the genome sequence is assumed intergenic)
    - other transitions involving states 7 thru 11: you figure these out!
  - Initial emission probabilities for each state should be 1/m, where m is the number of allowed trinucleotides for that state
  - Initial transition probabilities:
    - a₄₂ = .99, a₄₅ = .01
    - a₆₆ = .8, a₆₁ = .1, a_6,11 = .1
    - other cases determined by symmetry
Use Viterbi training as described in class to find improved parameter estimates. Run the training for 5 iterations, where for each iteration you:
1. Use dynamic programming to find the highest probability underlying state sequence.
2. Using this state sequence, compute
  - The numbers of top strand and bottom strand genes.
  - New transition probabilities and new emission probabilities, to be used in the next iteration.

Your output should provide

the name and first line of the .fna file
For each of the 5 iterations, the numbers of top strand and bottom strand genes given by the Viterbi parse, and the new transition and emission probabilities. Give probabilities to 3 decimal places only. In listing the transition probabilities, give only those a_ij which are non-zero. Format the emission probabilities for each state in the form of a codon table (as in HW 2), i.e. with the frequency of TTT in the upper left corner.

Email the output above to me and Chris. Please make it as compact as possible. Do NOT send the code itself. Include the output in the body of your email message (as plain text), NOT as an attachment.