Genome 540 Homework Assignment 8
Due Sunday Mar 6
- Write a program that implements an HMM for predicting genes in the genome sequence of Salmonella typhimurium (strain LT2). Specifically:
- The HMM has 11 states, numbered 1 through 11 (not counting the "begin" state 0). Each state emits observed symbols which are trinucleotides (rather than nucleotides); this is in order to allow us to model start and stop codons and the synonymous codon bias in coding sequences. Think of the original genome sequence, of length n, as being a sequence of n-2 (overlapping) trinucleotides: for example if the original sequence is AACGCT..., then the corresponding trinucleotides are AAC,ACG,CGC,GCT, ...
- State 1 corresponds to the first base of a start codon for a top-strand gene. It emits ATG, CTG, GTG, or TTG, but no other trinucleotides (i.e. the emission probability for every trinucleotide not of the form NTG is 0).
- State 2 corresponds to the first base of an internal codon for a top-strand gene. It emits any codon EXCEPT stop codons (i.e. TAA, TGA, and TAG have emission probabilities 0).
- State 3 corresponds to the second base of a start or internal codon for a top-strand gene. It can emit any trinucleotide (a trinucleotide emitted by state 3 corresponds to the 2d and 3d bases of one codon and the first base of the next codon).
- State 4 corresponds to the third base of a start or internal codon for a top-strand gene. It can emit any trinucleotide.
- State 5 corresponds to the first base of a stop codon. It emits only stop codons (i.e. any trinucleotide not of the form TAA, TGA, or TAG has emission probability 0).
- State 6 is the intergenic state. It can emit any trinucleotide.
- States 7 thru 11 are the bottom strand versions of states 1 thru 5 (so they emit complementary trinucleotides to those emitted by states 1 thru 5).
- The allowed transitions are as follows (you should make sure you understand why!):
- 1 transitions only to 3 (i.e. a1j = 0 unless j = 3)
- 2 transitions only to 3
- 3 transitions only to 4
- 4 transitions only to 2 or to 5
- 5 transitions only to 6
- 6 transitions only to 6, or to 1, or to 11
- 0 transitions only to 6 (i.e. the first base of the genome sequence is assumed intergenic)
- other transitions involving states 7 thru 11: you figure these out!
- Initial emission probabilities for each state should be 1/m, where m is the number of allowed trinucleotides for that state
- Initial transition probabilities:
- a42 = .99, a45 = .01
- a66 = .8, a61 = .1, a6,11 = .1
- other cases determined by symmetry
- Use Viterbi training as described in class to find improved parameter estimates. Run the training for 5 iterations, where for each iteration you:
- Use dynamic programming to find the highest probability underlying state sequence.
- Using this state sequence, compute
- The numbers of top strand and bottom strand genes.
- New transition probabilities and new emission probabilities, to be used in the next iteration.
Your output should provide
- the name and first line of the .fna file
- For each of the 5 iterations, the numbers of top strand and bottom strand genes given by the Viterbi parse, and the new transition and emission probabilities. Give probabilities to 5 decimal places. List all of probabilities, even when they are 0.
You must turn in your results and your computer
program, using the template (file format) described below.
Please put everything into ONE plain text file - do not send an archive
of files or a tar file, or a word processing document file. Compress it (using either Unix compress, or gzip -- if
you don't have access to either of these programs let us know), and
send it as an attachment to both Phil at phg@u.washington.edu, and
Tobias at mann@gs.washington.edu.
Here is the template.
first line of the .fna file that you use for
viterbi training. NOTE: please do not change the filename or
the first line in any way.
Please include all characters of the first line (including the '>'
character).
The gene histogram should give the number of top strand
and bottom strand genes found at this iteration, and
should be reported as below.
top strand genes=,bottom strand genes=
The model object should specify an HMM by giving
state labels, initial state probabilities (= transition probabilities from the begin state), state
transition probabilities, and symbol emission
probabilities.
give your state labels, separated by commas:
1,2
initial state probabilities should give,
for each state, the state label
and probability of starting in that state (i.e. the
probability of transitioning into
that state from the begin state) separated by an equals
sign. Entries should be separated by commas.
1=0.90000,2=0.10000
transition probabilities should give, for each
state, the state label and probability
of transitioning to that state from the
state indicated in the attributes list.
The present field (with state='1') gives the probabilities
of transitioning from state 1 to states 1 and 2:
1=0.99000,2=0.01000
1=0.20000,2=0.80000
For each symbol emitted by the state
indicated in the attributes for this
field, give the probability
of emitting that symbol.
TTT=.15000,TCT=.15000,TAT=.15000,TGT=.15000,
...
TTT=.20000,TCT=.15000,TAT=.15000,TGT=.10000,
...
put comments about your code here
file contents here