Genome 540 Homework Assignment 8

(Winter Quarter 2024)

Due Sunday Mar 3, 11:59pm

Write a program that implements a 2-state HMM for detecting G+C rich regions in Pyrococcus horikoshii (Genbank file, GFF file). Conceptually, state 1 will correspond to the 'A+T rich' state, whereas state 2 will correspond to the G+C-rich state. Specifically:
- The starting parameter values (adapted from Klein et al (2002), PNAS 99: 7542-47) should be as follows:
  - Transition probabilities a_ij are a_11 = .999, a_12 = .001, a_21 = .01, a_22 = .99
  - Initiation probabilities for each state (i.e. the transition probabilities from the 'begin' state into state 1 or 2) should be .996 for state 1, and .004 for state 2
  - Emission probabilities are
    - e_A = e_T = .30, e_G = e_C = .20 for state 1;
    - e_A = e_T = .15, e_G = e_C = .35 for state 2.
- Use EM (Baum-Welch) training as described in class to find improved parameter estimates. You should not hold any parameter values fixed or set to be equivalent -- allow all of them to change with each iteration.
  - In each iteration, compute:
    - the log-likelihood (to the base e, i.e., the natural log) of the sequence.
    - new initiation probabilities to be used in the next iteration.
    - new transition probabilities to be used in the next iteration.
    - new emission probabilities to be used in the next iteration.
Your output should provide:
- the name and first line of the .fna file
- the number of iterations until convergence
- the final log-likelihood
- the final initial, emission, and transition probabilities -- please output these in scientific notation, to four significant digits (i.e., 9.000e-1)
You must turn in your results and your computer program, using this template file. Please put everything into ONE file - do not send an archive of files or a tar file. After creating a plain text file (NOT a word processing document file) in this format, compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs, let us know), and send it as an attachment to both Phil at phg@uw.edu and Cliff at crostomily@gmail.com.