Genome 540 Homework Assignment 5

Due Sunday Feb 13

Write a program that implements a 2-state HMM for detecting G+C rich regions in the Thermocrinis albus DSM 14484 genome sequence you analyzed in assignment 2. Conceptually, state 1 will correspond to the more frequent 'A+T rich' state, whereas state 2 will correspond to the less frequent G+C-rich state. Specifically:
- The starting parameter values (taken from Klein et al (2002), PNAS 99: 7542-47) should be as follows:
  - Transition probabilities a_ij are a_11 = .999, a_12 = .001, a_21 = .01, a_22 = .99.
  - Initiation probabilities for each state (i.e. the transition probabilities from the 'begin' state into state 1 or 2) should be .996 for state 1, and .004 for state 2; these should be held fixed throughout the Viterbi training
  - Emission probabilities (which should also be held fixed) are
    - e_A = e_T = .291, e_G = e_C = .209 for state 1;
    - e_A = e_T = .169, e_G = e_C = .331 for state 2.
- Use Viterbi training as described in class to find improved parameter estimates for the transition probabilities, holding the emission and initiation probabilities fixed at the above values. Run the training for 10 iterations, where for each iteration you:
  1. Use dynamic programming to find the highest probability underlying state sequence.
  2. Using this state sequence, compute
    - The number of states of each of the two types (1 and 2), and the number of segments of each type (where a segment consists of a contiguous series of states of the same type, that is preceded and followed by states of the opposite type or the beginning or end of the sequence).
    - New transition probabilities to be used in the next iteration.
Your output should provide
- the name and first line of the .fna file
- the information described above (in 2. -- i.e. numbers of states and segments, and new probability values), for each of the 10 iterations. Give probabilities to 4 decimal places only.
- the list of G+C-rich segments (corresponding to the segments having state 2 as the underlying state) after the final (10th) round of Viterbi training.
- your description of the first 5 of the segments found above, as found by looking up the Genbank annotations.
Your output should conform to the template specified below.
You must turn in your results and your computer program, using this template file . Please put everything into ONE plain text file - do not send an archive of files or a tar file, or a word processing document file. Compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs let us know), and send it as an attachment to both Phil and Rupali. (The XML file includes a DTD, which specifies the XML file format. Place the DTD at the beginning of your XML document. When you are done with your XML check to make sure it conforms to the DTD using this website and resolve any errors before turning it in.