Genome 540 Homework Assignment 8

Due Sun Mar 6

Write a program that implements an HMM for identifying evolutionarily conserved segments from a multiple alignment of human, dog and mouse genome sequences. Specifically:
- The HMM has two states: a "neutral", fast-evolving state (state 1) and a "conserved", slow-evolving state (state 2).
- The emitted symbols for each state are multiple alignment columns (e.g. "AAA" or "A-T"). Columns containing substitutions and gaps have a higher probability of being emitted by the neutral state than the conserved state.
- Define emission probabilities for each state as follows:
  - Set neutral state (state 1) emission probabilities to alignment column frequencies from putative neutral sequences. To determine these frequencies use this list of alignment column counts from a large set of ancient repeat sequences. In this file base1 is human, the base2 is dog, and base3 is mouse.
  - Set conserved (state 2) emission probabilities to alignment column frequencies from putative functional sites. Use this list of alignment column counts from a large number of 1st and 2nd codon positions. The base ordering in this file is the same as for the ancient repeat counts described above.
- Use the following transition probabilities:
  - a₁₁ = 0.95, a₁₂ = 0.05
  - a₂₁ = 0.10, a₂₂ = 0.90
- Initiation probabilities should be 0.95 for state 1 and 0.05 for state2.
Use your HMM to determine the viterbi parse for this multiple alignment of dog and mouse to human ENCODE region ENm008 on chromosome 16. Gaps have been removed from the human sequence so that it is simple to determine the human genome coordinate of each alignment column. Also, two 'N's present in the human sequence have been changed to 'A's, so avoid dealing with ambiguous bases. The alignment is broken into many small blocks but the blocks are contiguous with respect to the human sequence. Within each block the first line is the human sequence (labelled hg18), the second line is the dog sequence (labelled canFam2) and the third line is the mouse sequence (labelled mm9).
Your output should provide:

The emission probabilities for the two states of your HMM reported to 5 decimal places (in addition to the other parameters that were provided for you).
Histograms describing the distribution of conserved/neutral states and segments in the Viterbi parse.
The coordinates of the 10 longest conserved (state 2) segments from the Viterbi parse. Make your output coordinates relative to the start of the chromosome by taking into account the start position of the alignment (position 32,668,237 on chromosome 21). This will make it possible to look up your segments in a genome browser. For example, a segment starting at the 10th alignment column would have the chromosomal start coordinate 32,668,246).
Give a brief annotation describing the genomic features that overlap your 5 longest conserved segments (e.g. an exon from a particular gene). You can do this by finding your segments in the UCSC genome browser.

You must turn in your results and your computer program, using this template file . Please put everything into ONE plain text file - do not send an archive of files or a tar file, or a word processing document file. Compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs let us know), and send it as an attachment to both Phil and Rupali. (The XML file includes a DTD, which specifies the XML file format. Place the DTD at the beginning of your XML document. When you are done with your XML check to make sure it conforms to the DTD using this website and resolve any errors before turning it in.