Genome 540 Homework Assignment 5

Due Sunday Feb. 14, 11:59pm

  1. Write a program to identify regions of elevated copy-number using the D-segment algorithm described in lecture, using the number of read-starts at every base to determine the base's score. This program should:
  2. Run your program on this file using the following scoring scheme:
  3. The input file has three columns: chromosome, position, and a read start count. The file was created based on the start positions of all reads mapping to chromosome 16 for the individual CHM13. Sequencing was performed on the Illumina platform, and the reads were mapped to the human reference hg38 using BWA. To create the likelihood scores, the number of read starts mapping to chromsome 16 were used to estimate the mean number, m, of read starts per base. This was then used to create two Poisson distributions: one with mean m, and the other with mean 1.5 m. The first distribution should apply to read starts in 'background' (= homozygous reference) while the second distribution should reflect a region heterozygously duplicated in the sample relative to the reference. The scoring scheme is the log likelihood ratio (with log base = 2) of these two distributions given the number of read starts observed. As discussed in lecture, 2^S should correspond (very approximately -- & ignoring the constant K) to the expected average spacing between segments of score >= S in 'background'.
  4. Using this example input and this example scoring scheme your output file should look like this template. Use the same template structure for your output on the actual file. Please put everything into ONE plain text file - do not send an archive of files or a tar file, or a word processing document file. Compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs let us know), and send it as an attachment to both Phil (phg (at) uw.edu) and Dani (dfaivre (at) uw.edu).