Genome 540 Homework Assignment 9

(Winter Quarter 2024)

Due Sunday Mar 10, 11:59pm

Write a program that implements an HMM for identifying evolutionarily conserved segments from a multiple alignment of human, dog and mouse genome sequences. Specifically:
- The HMM has two states: a "neutral", fast-evolving state (state 1) and a "conserved", slow-evolving state (state 2).
- The emitted symbols for each state are multiple alignment columns (e.g. "AAA" or "A-T"). Columns containing substitutions and gaps have a higher probability of being emitted by the neutral state than the conserved state.
- Define emission probabilities for each state as follows:
  - Set neutral state (state 1) emission probabilities to alignment column frequencies from putative neutral sequences. To determine these frequencies use this list of alignment column counts from a large set of ancient repeat sequences. In this file base1 is human, the base2 is dog, and base3 is mouse.
  - Set conserved (state 2) emission probabilities to alignment column frequencies from putative functional sites. Use this list of alignment column counts from a large number of 1st and 2nd codon positions. The base ordering in this file is the same as for the ancient repeat counts described above.
- Use the following transition probabilities:
  - a₁₁ = 0.95, a₁₂ = 0.05
  - a₂₁ = 0.10, a₂₂ = 0.90
- Initiation probabilities should be 0.95 for state 1 and 0.05 for state2.
Use your HMM to determine the Viterbi parse for this multiple alignment of dog and mouse to human ENCODE region ENm010 on chromosome 7. Gaps have been removed from the human sequence so that it is simple to determine the human genome coordinate of each alignment column. The alignment is broken into many small blocks. Within each block, the first line is the human sequence (labelled hg18), the second line is the dog sequence (labelled canFam2) and the third line is the mouse sequence (labelled mm9). 'N' characters have been replaced with 'A's to avoid having to deal with ambiguous bases.
Your output should provide:
- The emission probabilities for the two states of your HMM reported to 5 decimal places (in addition to the other parameters that were provided for you, i.e. initial probabilities and transition probabilities). Emission probabilities should be reported in alphabetical order (e.g. 1,A--; 1,A-A; 1,A-C;...; 1,TTT; 2,A--;...).
- Histograms describing the distribution of conserved/neutral states and segments in the Viterbi parse.
- The coordinates of the 10 longest conserved (state 2) segments from the Viterbi parse. Make your output coordinates relative to the start of the chromosome by taking into account the start position of the alignment (position 26,924,045 on chromosome 7). This will make it possible to look up your segments in a genome browser. For example, a segment starting at the 10th alignment column would have the chromosomal start coordinate 26,924,055).
- Give a brief annotation describing the genomic features that overlap your 5 longest conserved segments (e.g. an exon from a particular gene). You can do this by finding your segments in the UCSC genome browser.
As a test case, try your code on this truncated alignment of the ENm006 region. The 10 longest conserved segments in this test case should be: (152824343,152825071), (152871156,152871765), (152878440,152878808), (152889710,152890061), (152868885,152869199), (152890189,152890483), (152870540,152870811), (152876661,152876913), (152781311,152781545), (152874832,152875057)
You must turn in your results and your computer program, using this template file. Please put everything into ONE file - do not send an archive of files or a tar file. After creating a plain text file (NOT a word processing document file) in this format, compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs, let us know), and send it as an attachment to both Phil at phg@uw.edu and Cliff at crostomily@gmail.com.