Genome 540 Homework Assignment 6

Due Sunday Feb. 21, 11:59pm

In this assignment, you will create a randomized data set corresponding to the data set from HW5 and run the maximal D-segment algorithm on both of them.
- First, write a program that simulates a sequence of read start counts, using the following count distribution and total number of chromosome sites (this corresponds to the original data for HW 5, modified to exclude N's due to the fact that read alignments cannot start at an 'N'):
  - 0 read starts: 71,769,525 sites
  - 1 read start: 9,215,365 sites
  - 2 read starts: 735,282 sites
  - >=3 read starts: 85,673 sites
  - total number of sites: 81,805,845
  The distribution of read start counts in your simulated sequence should be close to this. The following pseudocode demonstrates one approach to creating this sequence:
```
N = total number of sites
counts[r] = number of sites with r read starts in (modified) original sequence
for each site 1...N
    x = random number between 0 and 1 (uniform distribution)
    if x < counts[0] / N
        randomized_counts[site] = 0
    else if x < (counts[0] + counts[1]) / N
        randomized_counts[site] = 1 
    else if x < (counts[0] + counts[1] + counts[2]) / N
        randomized_counts[site] = 2
    else 
        randomized_counts[site] = 3
```
  This randomization tends to eliminate the clustering of read starts due to copy number variation. Note, however, that we are still preserving the distribution of read start counts. As a result, this approach is expected to be more conservative than just randomly locating read starts across the sequence. The reason for doing things in this way is to allow for the fact that factors other than CNVs, such as library amplification, can also cause clustering of read starts at a particular site.
- Run your maximal D-segment algorithm on your simulated count sequence with D = -10. Identify segments using score thresholds S = 10 to 30 (i.e., 10, 11, 12...28, 29, 30). Use the following scoring scheme (LLR scores with log base 2):
  - score for 0 reads: -0.1077
  - score for 1 read: 0.4772
  - score for 2 reads: 1.0622
  - score for >=3 reads: 1.6748
- Run your maximal D-segment algorithm on the 'real data' sequence of read starts used in assignment 5 with the above D value, scoring scheme, and score thresholds.
- Generate two lists of pairs, one for the simulations and one for the real data. In each list, each row of paired data should contain:
  - S-value
  - N_seg(S), the number of D-segments found for this threshold of S (note that this is the number of segments with score at least S, not exactly S)
- Generate a list of ratios from the simulated data. Each row should contain:
  - N_seg(S) / N_seg(S + 1) (rounded to 2 decimal places)
  If there is a 0 in the denominator of your ratio (for cases where there were no D-segments for S), print -1. See the template file for other formatting details.
- As discussed in lecture, Karlin-Altschul theory predicts that, for LLR scores using logarithmic base b, the number of D-segments with scores at least S should be proportional to b^-S (b to the power -S; this is the reciprocal of the corresponding LR). Since our scores used logarithmic base 2, if N_seg(S1) is the number of D-segments found with score threshold S1, and N_seg(S2) is the number of D-segments found with score threshold S2, then the ratio N_seg(S1)/N_seg(S2) should be approximately equal to 2^(S2 - S1).
- Consider the following questions:
  - Does this relationship appear to be true for the simulated data?
  - Is it true for the real data?
  - Would you expect it to be true for the real data?
  - What score threshold is a reasonable one to use for the real data, to ensure a very low false positive rate?
  Answer these questions after the tables, as shown in the template file.
You must turn in your results and your computer program. Please put everything into ONE plain text file - do not send an archive of files or a tar file, or a word processing document file. Compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs let us know), and send it as an attachment to both Phil and Dani.