- The starting parameter values (adapted from Klein et al (2002), PNAS 99: 7542-47) should be as follows:
- Transition probabilities a_ij are a_11 = .999, a_12 = .001, a_21 = .01, a_22 = .99
- Initiation probabilities for each state (i.e. the transition probabilities from the 'begin' state into state 1 or 2) should be .996 for state 1, and .004 for state 2
- Emission probabilities are
- e_A = e_T = .30, e_G = e_C = .20 for state 1;
- e_A = e_T = .15, e_G = e_C = .35 for state 2.
- Use EM (Baum-Welch) training as described in class to find improved parameter estimates. You should not hold any parameter values fixed or set to be equivalent -- allow all of them to change with each iteration.
- In each iteration, compute:
- the log-likelihood (to the base e, i.e., the natural log) of the sequence.
- new initiation probabilities to be used in the next iteration.
- new transition probabilities to be used in the next iteration.
- new emission probabilities to be used in the next iteration.
Run the program until the increase in log-likelihood between successive iterations becomes less than 0.1. You should check that the log-likelihood increases with each iteration -- if it doesn't, something is wrong with your program. My program took less than 250 iterations to converge.