Genome 540 Homework Assignment 3

Due Sunday Feb. 2, 11:59pm

Write a program to identify the regions of elevated copy-number using the D-segment algorithm described in lecture, using the number of read-starts at every base to determine the base's score. This program should:
- read in two input files
  - a mapped read count file (described below)
  - a 'scoring scheme' file that indicates the score for each read count (0, 1, 2, or >=3 reads, scoring scheme described below)
- output the following information (see template file for example):
  - number of normal and elevated copy-number segments
  - a list of elevated copy-number segments including their start position, end position, and score (rounded to two decimal points)
  - annotations for the first three segments (use the UCSC genome browser and put down anything interesting about the segment)
  - histograms of read-start counts (i.e. number of positions with 0, 1, 2, and >=3 read-starts) for a) normal segments, and b) elevated copy-number segments
Run your program on this file using the following scoring scheme:
- score for 0 reads: -0.1282
- score for 1 read: 0.5649
- score for 2 reads: 1.2581
- score for >=3 reads: 1.9842
- D = -20
- S = -D = 20
The input file has three columns: chromosome, position, and a read start count. The file was created based on the start positions of all reads mapping to chromosome 16 for the individual CHM13. Sequencing was performed on the Illumina platform, and the reads were mapped to the human reference hg38 using BWA. To create the likelihood scores the number of read starts genome wide was used to estimate the mean number of read starts per base. This mean was then used to create two Poisson distributions: one with the mean number of read starts per base, and the other with twice the mean number of read starts per base. The first distribution captures the expected number of read starts, while the second distribution should better reflect a region duplicated in this sample relative to hg38. The scoring scheme is the log likelihood ratio of these two distributions given the number of read starts observed. The alignments are available here. The code to generate the input can be found here and here.
Using this example input and this example scoring scheme your output file should look like this template. Use the same template structure for your output on the actual file. Please put everything into ONE plain text file - do not send an archive of files or a tar file, or a word processing document file. Compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs let us know), and send it as an attachment to both Phil (phg (at) u.washington.edu) and Mitchell (mvollger (at) uw.edu).