Genome 540 Homework Assignment 8
Due Wednesday March 13, 11:59pm
- Write a program that implements an HMM for identifying
evolutionarily conserved segments from a multiple alignment of
human, dog and mouse genome sequences. Specifically:
- The HMM has two states: a "neutral", fast-evolving state (state 1)
and a "conserved", slow-evolving state (state 2).
- The emitted symbols for each state are multiple alignment columns
(e.g. "AAA" or "A-T"). Columns containing substitutions and
gaps have a higher probability of being emitted by the neutral state
than the conserved state.
- Define emission probabilities for each state as follows:
- Set neutral state (state 1) emission probabilities to
alignment column frequencies from
putative neutral sequences. To determine these frequencies
use this list of
alignment column counts from a large set of ancient repeat sequences.
In this file base1 is human, the base2 is dog, and base3 is mouse.
- Set conserved (state 2) emission probabilities to
alignment column frequencies from putative functional
sites. Use this list
of alignment column counts from a large number of 1st and
2nd codon positions. The base ordering in this file is
the same as for the ancient repeat counts described
above.
- Use the following transition probabilities:
- a11 = 0.95, a12 = 0.05
- a21 = 0.10, a22 = 0.90
- Initiation probabilities should be 0.95 for state 1 and 0.05 for state2.
- Use your HMM to determine the Viterbi parse for this multiple alignment of dog and mouse
to human ENCODE region ENm010 on chromosome 7.
Gaps have been removed from the human sequence
so that it is simple to determine the human genome coordinate of
each alignment column.
The alignment is broken into many small
blocks.
Within each block, the first line is the human sequence (labelled
hg18), the second line is the dog sequence (labelled canFam2) and the
third line is the mouse sequence (labelled mm9). 'N' characters
have been replaced with 'A's to avoid having to deal with
ambiguous bases.
- Your output should provide:
- The emission probabilities for the two states of your HMM reported to 5 decimal places (in
addition to the other parameters that were provided for you, i.e. initial probabilities and transition probabilities). Emission probabilities should be reported in alphabetical order (e.g. 1,A--; 1,A-A; 1,A-C;...; 1,TTT; 2,A--;...).
- Histograms describing the distribution of conserved/neutral
states and segments in the Viterbi parse.
- The coordinates of the 10 longest conserved (state 2)
segments from the Viterbi parse. Make your output coordinates
relative to the start of the chromosome by taking into account
the start position of the alignment (position 1,859 on
chromosome 16). This will make it possible to look up your
segments in a genome browser. For example, a segment starting at
the 10th alignment column would have the chromosomal start
coordinate 1,869).
- Give a brief annotation describing the genomic features that
overlap your 5 longest conserved segments (e.g. an
exon from a particular gene). You can do this by finding your
segments in the UCSC genome browser.
- As a test case, try your code on this truncated alignment of the ENm006 region.
The 10 longest conserved segments in this test case should be:
(152824343,152825071),
(152871156,152871765),
(152878440,152878808),
(152889710,152890061),
(152868885,152869199),
(152890189,152890483),
(152870540,152870811),
(152876661,152876913),
(152781311,152781545),
(152874832,152875057)
- You must turn in your results and your computer
program, using this example file .
Please note that the template file references the wrong chromosome start location, don't let this confuse you!
Please put everything into ONE plain text file - do not send an
archive of files or a tar file, or a word processing document
file. Compress it (using either Unix compress, or gzip -- if you
don't have access to either of these programs let us know), and send
it as an attachment to both Phil and Eliah.