Genome 540 Homework Assignment 6
Due Sunday Feb 20
- Write a program that implements a 2-state HMM for detecting G+C rich regions in the Pyrococcus furiosus genome sequence. Conceptually, state 1 will correspond to the more frequent 'A+T rich' state, whereas state 2 will correspond to the less frequent G+C-rich state. Specifically:
- The starting parameter values (taken from Klein et al (2002), PNAS 99: 7542-47) should be as follows:
- Transition probabilities a_ij are a_11 = .999, a_12 = .001, a_21 = .01, a_22 = .99.
- Initiation probabilities for each state (i.e. the transition probabilities from the 'begin' state into state 1 or 2) should be .996 for state 1, and .004 for state 2; these should be held fixed throughout the Viterbi training
- Emission probabilities (which should also be held fixed) are
- e_A = e_T = .296, e_G = e_C = .204 for state 1;
- e_A = e_T = .159, e_G = e_C = .341 for state 2.
- Use Viterbi training as described in class to find improved parameter estimates for the transition probabilities, holding the emission and initiation probabilities fixed at the above values. Run the training for 10 iterations, where for each iteration you:
- Use dynamic programming to find the highest probability underlying state sequence.
- Using this state sequence, compute
- The number of states of each of the two types (1 and 2), and the number of segments of each type (where a segment consists of a contiguous series of states of the same type, that is preceded and followed by states of the opposite type or the beginning or end of the sequence).
- New transition probabilities to be used in the next iteration.
- Your output should provide
- the name and first line of the .fna file
- the information described above (in 2. -- i.e. numbers of states and segments, and new probability values), for each of the 10 iterations. Give probabilities to 3 decimal places only.
- the list of G+C-rich segments (corresponding to the segments having state 2 as the underlying state) after the final (10th) round of Viterbi training.
- your description of the first 10 of the segments found above, as found by looking up the Genbank annotations.
- You must turn in your results and your computer
program, using the template (file format) described below.
Please put everything into ONE plain text file - do not send an archive
of files or a tar file, or a word processing document file. Compress it (using either Unix compress, or gzip -- if
you don't have access to either of these programs let us know), and
send it as an attachment to both Phil at phg@u.washington.edu, and
Tobias at mann@gs.washington.edu.
Here is the template:
first line of the .fna file that you use for
viterbi training. NOTE: please do not change the filename or the first line in any way.
Please include all characters of the first line (including the '>'
character).
This result object should occur 10 times in your
homework, one for each iteration. It should give information
about states in the Viterbi (highest-probability) path, and
the hmm parameter estimates derived from the Viterbi state sequence,
and which are used in the next iteration, as indicated below.
State histograms should give, for each state,
the state label (1 or 2), followed by an equals
sign, followed by the number of positions in the
sequence having that state in the Viterbi parse. Put a comma between
the entries. For instance, if the sequence has length 8 and the
Viterbi parse is
11222111
then your histogram should be:
1=5,2=3
Segment histograms should give, for each state,
the state label, followed by an equals
sign, followed by the number of segments consisting
of that state. For instance, for the Viterbi parse above
your histogram would be:
1=2,2=1
The model object should specify an HMM by giving
state labels, initial state probabilities, state
transition probabilities, and symbol emission
probabilities.
give your state labels, separated by commas:
1,2
initial state probabilities should give,
for each state, the state label
and probability of starting in that state (i.e. the probability of transitioning into
that state from the begin state) separated by an equals
sign. Entries should be separated by commas.
1=0.900,2=0.100
transition probabilities should give, for each
state, the state label and probability
of transitioning to that state from the
state indicated in the attributes list.
The present field (with state='1') gives the probabilities
of transitioning from state 1 to states 1 and 2:
1=0.990,2=0.010
1=0.200,2=0.800
For each symbol emitted by the state
indicated in the attributes for this
field, give the probability
of emitting that symbol.
A=.250,C=.250,G=.250,T=.250
A=.300,C=.250,G=.250,T=.200
Give the state 2 segments found in the Viterbi path after the last (10th) iteration,
in the form (beginning of 1st segment,end of 1st seg), (beginning of 2d seg,end of 2d seg),...
For example, if the Viterbi path is 11112221112111222 you would have
(5,7),(11,11),(15,17)
Find the genbank annotations for the
10 segments closest to the beginning
of the genome sequence and report them as follows:
put the genbank annotion of the feature
here. You will need 10 of these objects
put comments about your code here
file contents here