Genome 540 Homework Assignment 8
Due Sunday, Mar. 7, 11:59pm
- Write a program that implements a 2-state HMM for detecting G+C rich regions in
Dehalobacter restrictus (Genbank file,
GFF file).
Conceptually, state 1 will correspond to the 'A+T rich' state, whereas state 2 will correspond to the G+C-rich state. Specifically:
- The starting parameter values (adapted from Klein et al (2002), PNAS 99: 7542-47) should be as follows:
- Transition probabilities a_ij are a_11 = .999, a_12 = .001, a_21 = .01, a_22 = .99
- Initiation probabilities for each state (i.e. the transition probabilities from the 'begin' state into state 1 or 2) should be .996 for state 1, and .004 for state 2
- Emission probabilities are
- e_A = e_T = .30, e_G = e_C = .20 for state 1;
- e_A = e_T = .15, e_G = e_C = .35 for state 2.
- Use EM (Baum-Welch) training as described in class to find improved parameter estimates. You should not hold any parameter values fixed or set to be equivalent -- allow all of them to change with each iteration.
- In each iteration, compute:
- the log-likelihood (to the base e, i.e., the natural log) of the sequence.
- new initiation probabilities to be used in the next iteration.
- new transition probabilities to be used in the next iteration.
- new emission probabilities to be used in the next iteration.
Run the program until the increase in log-likelihood between successive iterations becomes less than 0.1. You should check that the log-likelihood increases with each iteration -- if it doesn't, something is wrong with your program.
My program took less than 400 iterations to converge.
- Your output should provide:
- the name and first line of the .fna file
- the number of iterations until convergence
- the final log-likelihood
- the final initial, emission, and transition probabilities
-- please output these in scientific notation, to four significant digits (i.e., 9.000e-1)
- You must turn in your results and your computer
program, using this template file.
Please put everything into ONE plain text file - do not send an archive
of files or a tar file, or a word processing document file. Compress it (using either Unix compress, or gzip -- if
you don't have access to either of these programs let us know), and
send it as an attachment to both Phil (phg (at) uw.edu) and Dani (dfaivre (at) uw.edu).