Genome 540 Homework Assignment 6

Due Sunday Feb 20

Write a program that implements a 2-state HMM for detecting G+C rich regions in the Pyrococcus furiosus genome sequence. Conceptually, state 1 will correspond to the more frequent 'A+T rich' state, whereas state 2 will correspond to the less frequent G+C-rich state. Specifically:
- The starting parameter values (taken from Klein et al (2002), PNAS 99: 7542-47) should be as follows:
  - Transition probabilities a_ij are a_11 = .999, a_12 = .001, a_21 = .01, a_22 = .99.
  - Initiation probabilities for each state (i.e. the transition probabilities from the 'begin' state into state 1 or 2) should be .996 for state 1, and .004 for state 2; these should be held fixed throughout the Viterbi training
  - Emission probabilities (which should also be held fixed) are
    - e_A = e_T = .296, e_G = e_C = .204 for state 1;
    - e_A = e_T = .159, e_G = e_C = .341 for state 2.
- Use Viterbi training as described in class to find improved parameter estimates for the transition probabilities, holding the emission and initiation probabilities fixed at the above values. Run the training for 10 iterations, where for each iteration you:
  1. Use dynamic programming to find the highest probability underlying state sequence.
  2. Using this state sequence, compute
    - The number of states of each of the two types (1 and 2), and the number of segments of each type (where a segment consists of a contiguous series of states of the same type, that is preceded and followed by states of the opposite type or the beginning or end of the sequence).
    - New transition probabilities to be used in the next iteration.
Your output should provide
- the name and first line of the .fna file
- the information described above (in 2. -- i.e. numbers of states and segments, and new probability values), for each of the 10 iterations. Give probabilities to 3 decimal places only.
- the list of G+C-rich segments (corresponding to the segments having state 2 as the underlying state) after the final (10th) round of Viterbi training.
- your description of the first 10 of the segments found above, as found by looking up the Genbank annotations.
You must turn in your results and your computer program, using the template (file format) described below. Please put everything into ONE plain text file - do not send an archive of files or a tar file, or a word processing document file. Compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs let us know), and send it as an attachment to both Phil at phg@u.washington.edu, and Tobias at mann@gs.washington.edu.

Here is the template: <gs540_hw assignment='6' name='student name' email='student email'> <results> <result type='first line' file='filename'> first line of the .fna file that you use for viterbi training. NOTE: please do not change the filename or the first line in any way. Please include all characters of the first line (including the '>' character). </result> <result type='viterbi iteration' iteration='1'> This result object should occur 10 times in your homework, one for each iteration. It should give information about states in the Viterbi (highest-probability) path, and the hmm parameter estimates derived from the Viterbi state sequence, and which are used in the next iteration, as indicated below. <result type='state histogram'> State histograms should give, for each state, the state label (1 or 2), followed by an equals sign, followed by the number of positions in the sequence having that state in the Viterbi parse. Put a comma between the entries. For instance, if the sequence has length 8 and the Viterbi parse is 11222111 then your histogram should be: 1=5,2=3 </result> <result type='segment histogram'> Segment histograms should give, for each state, the state label, followed by an equals sign, followed by the number of segments consisting of that state. For instance, for the Viterbi parse above your histogram would be: 1=2,2=1 </result> <model type='hmm'> The model object should specify an HMM by giving state labels, initial state probabilities, state transition probabilities, and symbol emission probabilities. <states> give your state labels, separated by commas: 1,2 </states> <initial_state_probabilities> initial state probabilities should give, for each state, the state label and probability of starting in that state (i.e. the probability of transitioning into that state from the begin state) separated by an equals sign. Entries should be separated by commas. 1=0.900,2=0.100 </initial_state_probabilities> <transition_probabilities state='1'> transition probabilities should give, for each state, the state label and probability of transitioning to that state from the state indicated in the attributes list. The present field (with state='1') gives the probabilities of transitioning from state 1 to states 1 and 2: 1=0.990,2=0.010 </transition_probabilities> <transition_probabilities state='2'> 1=0.200,2=0.800 </transition_probabilities> <emission_probabilities state='1'> For each symbol emitted by the state indicated in the attributes for this field, give the probability of emitting that symbol. A=.250,C=.250,G=.250,T=.250 </emission_probabilities> <emission_probabilities state='2'> A=.300,C=.250,G=.250,T=.200 </emission_probabilities> </model> </result> <result type='segment list'> Give the state 2 segments found in the Viterbi path after the last (10th) iteration, in the form (beginning of 1st segment,end of 1st seg), (beginning of 2d seg,end of 2d seg),... For example, if the Viterbi path is 11112221112111222 you would have (5,7),(11,11),(15,17) </result> </results> <analysis> Find the genbank annotations for the 10 segments closest to the beginning of the genome sequence and report them as follows: <annotation segment='1'> put the genbank annotion of the feature here. You will need 10 of these objects </annotation> </analysis> <program> <comments> put comments about your code here </comments> <file> file contents here </file> </program> </gs540_hw>