Genome 540 Homework Assignment 8
Due Sun Mar 6
- Write a program that implements an HMM for identifying
evolutionarily conserved segments from a multiple alignment of
human, dog and mouse genome sequences. Specifically:
- The HMM has two states: a "neutral", fast-evolving state (state 1)
and a "conserved", slow-evolving state (state 2).
- The emitted symbols for each state are multiple alignment columns
(e.g. "AAA" or "A-T"). Columns containing substitutions and
gaps have a higher probability of being emitted by the neutral state
than the conserved state.
- Define emission probabilities for each state as follows:
- Set neutral state (state 1) emission probabilities to
alignment column frequencies from
putative neutral sequences. To determine these frequencies
use this list of
alignment column counts from a large set of ancient repeat sequences.
In this file base1 is human, the base2 is dog, and base3 is mouse.
- Set conserved (state 2) emission probabilities to
alignment column frequencies from putative functional
sites. Use this list
of alignment column counts from a large number of 1st and
2nd codon positions. The base ordering in this file is
the same as for the ancient repeat counts described
above.
- Use the following transition probabilities:
- a11 = 0.95, a12 = 0.05
- a21 = 0.10, a22 = 0.90
- Initiation probabilities should be 0.95 for state 1 and 0.05
for state2.
- Use your HMM to determine the viterbi parse for this multiple alignment of dog and mouse
to human ENCODE
region ENm008
on chromosome 16. Gaps have been removed from the human sequence
so that it is simple to determine the human genome coordinate of
each alignment column. Also, two 'N's present in the human sequence have been changed to 'A's, so avoid dealing with ambiguous bases.
The alignment is broken into many small
blocks but the blocks are contiguous with respect to the human
sequence. Within each block the first line is the human sequence
(labelled hg18), the second line is the dog sequence (labelled
canFam2) and the third line is the mouse sequence (labelled mm9).
- Your output should provide:
- The emission probabilities for the two states of your HMM reported to 5 decimal places (in
addition to the other parameters that were provided for you).
- Histograms describing the distribution of conserved/neutral
states and segments in the Viterbi parse.
- The coordinates of the 10 longest conserved (state 2)
segments from the Viterbi parse. Make your output coordinates
relative to the start of the chromosome by taking into account
the start position of the alignment (position 32,668,237 on
chromosome 21). This will make it possible to look up your
segments in a genome browser. For example, a segment starting at
the 10th alignment column would have the chromosomal start
coordinate 32,668,246).
- Give a brief annotation describing the genomic features that
overlap your 5 longest conserved segments (e.g. an
exon from a particular gene). You can do this by finding your
segments in the UCSC genome browser.
- You must turn in your results and your computer
program, using this template file .
Please put everything into ONE plain text file - do not send an
archive of files or a tar file, or a word processing document
file. Compress it (using either Unix compress, or gzip -- if you
don't have access to either of these programs let us know), and send
it as an attachment to both Phil and Rupali.
(The XML file includes a DTD, which specifies the XML file
format. Place the DTD at the beginning of your XML document. When you
are done with your XML check to make sure it conforms to the DTD using
this website and
resolve any errors before turning it in.