Genome 540 Homework Assignment 7
Due Thu Mar 5
- Write a program that implements an HMM for predicting genes in the genome sequence of Shigella dysenteriae (NC_007606) (used in HW 2)
Specifically:
- The HMM has 11 states, numbered 1 through 11 (not counting the "begin" state 0). Each state emits observed symbols which are trinucleotides (rather than nucleotides); this is in order to allow us to model start and stop codons and the synonymous codon bias in coding sequences. Think of the original genome sequence, of length n, as being a sequence of n-2 (overlapping) trinucleotides: for example if the original sequence is AACGCT..., then the corresponding trinucleotides are AAC,ACG,CGC,GCT, ...
- State 1 corresponds to the first base of a start codon for a top-strand gene. It emits ATG, CTG, GTG, or TTG, but no other trinucleotides (i.e. the emission probability for every trinucleotide not of the form NTG is 0).
- State 2 corresponds to the first base of an internal codon for a top-strand gene. It emits any codon EXCEPT stop codons (i.e. TAA, TGA, and TAG have emission probabilities 0).
- State 3 corresponds to the second base of a start or internal codon for a top-strand gene. It can emit any trinucleotide (a trinucleotide emitted by state 3 corresponds to the 2d and 3d bases of one codon and the first base of the next codon).
- State 4 corresponds to the third base of a start or internal codon for a top-strand gene. It can emit any trinucleotide.
- State 5 corresponds to the first base of a stop codon. It emits only stop codons (i.e. any trinucleotide not of the form TAA, TGA, or TAG has emission probability 0).
- State 6 is the intergenic state. It can emit any trinucleotide.
- States 7 thru 11 are the bottom strand versions of states 1 thru 5 (so they emit complementary trinucleotides to those emitted by states 1 thru 5).
- The allowed transitions are as follows (you should make sure you understand why!):
- 1 transitions only to 3 (i.e. a1j = 0 unless j = 3)
- 2 transitions only to 3
- 3 transitions only to 4
- 4 transitions only to 2 or to 5
- 5 transitions only to 6
- 6 transitions only to 6, or to 1, or to 11
- 0 transitions only to 6 (i.e. the first base of the genome sequence is assumed intergenic)
- other transitions involving states 7 thru 11: you figure these out!
- Initial emission probabilities for each state should be 1/m, where m is the number of allowed trinucleotides for that state
- Initial transition probabilities:
- a42 = .99, a45 = .01
- a66 = .8, a61 = .1, a6,11 = .1
- other cases determined by symmetry
- Use Viterbi training as described in class to find improved parameter estimates. Run the training for 5 iterations, where for each iteration you:
- Use dynamic programming to find the highest probability underlying state sequence.
- Using this state sequence, compute
- The numbers of top strand and bottom strand genes.
- New transition probabilities and new emission probabilities, to be used in the next iteration.
Your output should provide
- the name and first line of the .fna file
- For each of the 5 iterations, the numbers of top strand and bottom strand genes given by the Viterbi parse, and the new transition and emission probabilities. Give probabilities to 5 decimal places. List all of the probabilities, even when they are 0.
- Give a brief annotation for the first 3 of the predicted genes found above by looking up the Genbank annotations.
You must turn in your results and your computer
program, using this template file .
Please put everything into ONE plain text file - do not send an archive
of files or a tar file, or a word processing document file. Compress it (using either Unix compress, or gzip -- if
you don't have access to either of these programs let us know), and
send it as an attachment to both Phil and Graham.
(The XML file includes a DTD, which specifies the XML file format. Place the DTD at the beginning of your XML document. When you are done with your XML check to make
sure it conforms to the DTD using this website and resolve any errors before turning it in.