Genome 540 Homework Assignment 8

Due Sun March 6

Write a program (based on the 'D-segment' pseudocode provided in lecture) to identify regions of elevated copy-number as 'D-segments', using the number of read-starts observed at every base.
Allow user settable parameter values
- N = Expected length of normal copy number region (state 1)
- E = Expected length of elevated copy number region (state 2)
Use these values to set the transition probabilities: a₁₁, a₁₂, a₂₁, a₂₂.
- a₁₂ = 1/N,
- a₂₁ = 1/E,
- a₁₁ and a₂₂ will follow from these.
Observed symbols are counts of # read-starts at each position (r = 0,1,2, >=3). The emission probabilities (e₁(r), e₂(r)) for each symbol r in states 1 and 2 will be given by Poisson distribution with mean=m₁ for state1 and mean=m₂ for state 2. The means will be set based on the read coverage of input region. Note that counts 3 or greater are all lumped together in one category (represented by symbol ">=3") and its associated probability value should be calculated accordingly.
The scoring scheme should be as described during lecture:
- The score contibuted by each base along the sequence will be s(r) = log₂((e₂(r) × a₂₂)/(e₁(r) × a₁₁))
- S = -D = log₂((a₁₁ × a₂₂)/(a₁₂ × a₂₁))
Your program should output:
- a histogram of state counts
- a histogram of segment counts
- the transition and emission probabilities
- threshold S
- a list of the segments (inclusive base-1 starting & ending positions, and scores - i.e., if these are the first ten states 1112222111: start=4, end=7)
- annotations for the first three segments (use the UCSC genome browser and just tell me if anything interesting is in the segment)
- a histogram of read-start counts (i.e. number of positions with 0, 1, 2 and >=3 read-starts)
- a histogram of read-start counts (i.e. number of positions with 0, 1, 2 and >=3 read-starts), only for positions in state 2
Run the program on this file, using parameter values
- N = 1MB.
- E = 10KB.
- Mean of Poisson distribution for state 1 (m₁) = 0.074 (The input was covered at 14x with 188bp reads. Thus we expect a read start on an average every 14/188 = 0.074 bases)
- Mean of Poisson distribution for state 2 (m₂) = 0.11 (= 1.5 × 0.074)
The input file has three columns: chromosome, position, and # of read-starts. The file was created based on the start positions of all reads mapping to chromosome 15 for individual NA12878, sequenced as a part of the 1000 genomes project. Sequencing was performed on the Illumina platform, and the reads were mapped to the human reference hg19 using BWA. The alignments are available on the 1000 genomes project website in BAM format.
You must turn in your results and your computer program, using this template file. Please put everything into ONE plain text file - do not send an archive of files or a tar file, or a word processing document file. Compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs let us know), and send it as an attachment to both Phil and Anne.