Genome 540 Homework Assignment 7

Due Sunday March 1st, 11:59pm

Write a program that implements a 2-state HMM for detecting G+C rich regions in the Pyrococcus horikoshii (Genbank file, GFF file). Conceptually, state 1 will correspond to the more frequent 'A+T rich' state, whereas state 2 will correspond to the less frequent G+C-rich state. Specifically:
- The starting parameter values (adapted from Klein et al (2002), PNAS 99: 7542-47) should be as follows:
  - Transition probabilities a_ij are a_11 = .999, a_12 = .001, a_21 = .01, a_22 = .99.
  - Initiation probabilities for each state (i.e. the transition probabilities from the 'begin' state into state 1 or 2) should be .996 for state 1, and .004 for state 2; these should be held fixed throughout the Viterbi training
  - Emission probabilities are
    - e_A = e_T = .30, e_G = e_C = .20 for state 1;
    - e_A = e_T = .15, e_G = e_C = .35 for state 2.
- Use Viterbi training as described in class to find improved parameter estimates for the transition probabilities. The emission probabilities should be updated as described in lecture with the exception that you will calculate the counts for A and T together and G and C together such that the frequencies e_G=e_C and e_A=e_T. Do not update the initiation probabilities. Run the training for 10 iterations, where for each iteration you:
  1. Use dynamic programming to find the highest probability underlying state sequence.
  2. Using this state sequence, compute
    - The number of states of each of the two types (1 and 2), and the number of segments of each type (where a segment consists of a contiguous series of states of the same type, that is preceded and followed by states of the opposite type or the beginning or end of the sequence).
    - New transition probabilities to be used in the next iteration.
    - New emission probabilities to be used in the next iteration.
Your output should provide
- the name and first line of the .fna file
- the information described above (in 2. -- i.e. numbers of states and segments, and new probability values), for each of the 10 iterations. Give probabilities to 4 decimal places only.
- the list of G+C-rich segments (corresponding to the segments having state 2 as the underlying state) after the final (10th) round of Viterbi training, sorted by genomic position.
- your description of the first 5 of the segments found above, as found by looking up the Genbank annotations.
Your output should conform to the template specified below.
You must turn in your results and your computer program, using this example file. Note that unlike prior assignments, this example file is not a correct solution; it only provides an formatting example. It also contains only 2 iterations and 2 genbank annotations - your solution should contain 10 and 5, respectively. Please put everything into ONE plain text file - do not send an archive of files or a tar file, or a word processing document file. Compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs let us know), and send it as an attachment to both Phil (phg (at) u.washington.edu) and Mitchell (mvollger (at) uw.edu).

Template Details:

Transition Probabilities: The probability of transitioning from state 1 to state 1, 1 to 2, etc.
1,1=0.9900
1,2=0.0100
2,1=0.2000
2,2=0.8000
Emission Probabilities: The probability of emitting A in state 1, C in state 1, G in state 1, etc.
1,A=0.2500
1,C=0.2500
1,G=0.2500
Segment List: A list of segments with state 2 - start position to end position.
5 7
11 11
15 17