Genome 540 Homework Assignment 5

Due Sunday Feb 16

Write a program that implements a 2-state HMM for detecting G+C rich regions in the Vibrio cholerae IEC224 chromosome II sequence (Genbank file). Conceptually, state 1 will correspond to the more frequent 'A+T rich' state, whereas state 2 will correspond to the less frequent G+C-rich state. Specifically:
- The starting parameter values (taken from Klein et al (2002), PNAS 99: 7542-47) should be as follows:
  - Transition probabilities a_ij are a_11 = .999, a_12 = .001, a_21 = .01, a_22 = .99.
  - Initiation probabilities for each state (i.e. the transition probabilities from the 'begin' state into state 1 or 2) should be .996 for state 1, and .004 for state 2; these should be held fixed throughout the Viterbi training
  - Emission probabilities (which should also be held fixed) are
    - e_A = e_T = .291, e_G = e_C = .209 for state 1;
    - e_A = e_T = .169, e_G = e_C = .331 for state 2.
- Use Viterbi training as described in class to find improved parameter estimates for the transition probabilities, holding the emission and initiation probabilities fixed at the above values. Run the training for 10 iterations, where for each iteration you:
  1. Use dynamic programming to find the highest probability underlying state sequence.
  2. Using this state sequence, compute
    - The number of states of each of the two types (1 and 2), and the number of segments of each type (where a segment consists of a contiguous series of states of the same type, that is preceded and followed by states of the opposite type or the beginning or end of the sequence).
    - New transition probabilities to be used in the next iteration.
Your output should provide
- the name and first line of the .fna file
- the information described above (in 2. -- i.e. numbers of states and segments, and new probability values), for each of the 10 iterations. Give probabilities to 4 decimal places only.
- the list of G+C-rich segments (corresponding to the segments having state 2 as the underlying state) after the final (10th) round of Viterbi training, sorted by genomic position.
- your description of the first 5 of the segments found above, as found by looking up the Genbank annotations.
Your output should conform to the template specified below.
You must turn in your results and your computer program, using this example file. Note that unlike prior assignments, this example file is not a correct solution, it only provides an formatting example. It also contains only 2 iterations and 2 genbank annotations - your solution should contain 10 and 5, respectively. Please put everything into ONE plain text file - do not send an archive of files or a tar file, or a word processing document file. Compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs let us know), and send it as an attachment to both Phil (phg u.washington.edu) and Benjamin (bvernot gmail.com).

Template Details:

Transition Probabilities: The probability of transitioning from state 1 to state 1, 1 to 2, etc.
1,1=0.9900
1,2=0.0100
2,1=0.2000
2,2=0.8000
Emission Probabilities: The probability of emitting A in state 1, C in state 1, G in state 1, etc.
1,A=0.2500
1,C=0.2500
1,G=0.2500
Segment List: A list of segments with state 2 - start position to end position.
5 7
11 11
15 17