Genome 540 Homework Assignment 8

Genome 540 Homework Assignment 7

Due Sun Mar 15

Write a program that implements an HMM for identifying evolutionarily conserved segments from a multiple alignment of human, dog and mouse genome sequences. Specifically:
- The HMM has two states: a "neutral", fast-evolving state (state 1) and a "conserved", slow-evolving state (state 2).
- The emitted symbols for each state are multiple alignment columns (e.g. "AAA" or "A-T"). Columns containing substitutions and gaps have a higher probability of being emitted by the neutral state than the conserved state.
- Define emission probabilities for each state as follows:
  - Set neutral state (state 1) emission probabilities to alignment column frequencies from putative neutral sequences. To determine these frequencies use this list of alignment column counts from a large set of ancient repeat sequences. In this file base1 is human, the base2 is dog, and base3 is mouse.
  - Set conserved (state 2) emission probabilities to alignment column frequencies from putative functional sites. Use this list of alignment column counts from a large number of 1st and 2nd codon positions. The base ordering in this file is the same as for the ancient repeat counts described above.
- Use the following transition probabilities:
  - a₁₁ = 0.95, a₁₂ = 0.05
  - a₂₁ = 0.10, a₂₂ = 0.90
- Initiation probabilities should be 0.95 for state 1 and 0.05 for state2.
Use your HMM to determine the viterbi parse for this multiple alignment of dog and mouse to human ENCODE region ENm001 on chromosome 7. Gaps have been removed from the human sequence so that it is simple to determine the human genome coordinate of each alignment column. The alignment is broken into many small blocks but the blocks are contiguous with respect to the human sequence. Within each block the first line is the human sequence (labelled hg18), the second line is the dog sequence (labelled canFam2) and the third line is the mouse sequence (labelled mm9).
Your output should provide:

The emission probabilities for the two states of your HMM reported to 5 decimal places (in addition to the other parameters that were provided for you).
Histograms describing the distribution of conserved/neutral states and segments in the Viterbi parse.
The coordinates of the 10 longest conserved (state 2) segments from the Viterbi parse. Make your output coordinates relative to the start of the chromosome by taking into account the start position of the alignment (position 115,597,756 on chromosome 7). This will make it possible to look up your segments in a genome browser. For example, a segment starting at the 10th alignment column would have the chromosomal start coordinate 115,597,765).
Give a brief annotation describing the genomic features that overlap your 5 longest conserved segments (e.g. an exon from a particular gene). You can do this by finding your segments in the UCSC genome browser.

You must turn in your results and your computer program, using this template file . Please put everything into ONE plain text file - do not send an archive of files or a tar file, or a word processing document file. Compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs let us know), and send it as an attachment to both Phil and Graham. (The XML file includes a DTD, which specifies the XML file format. Place the DTD at the beginning of your XML document. When you are done with your XML check to make sure it conforms to the DTD using this website and resolve any errors before turning it in.