First, write a program that simulates a sequence of read
starts, at the same number of genomic positions as in the original
data. The distribution of read start counts (the number of
positions with 0, 1, 2, and >=3 read starts) should be, on
average, the same as in the original data. The following
pseudocode demonstrates one approach to creating this
sequence:
N = number of sites in original sequence
counts[r] = number of sites with r read starts in original sequence
for each site 1..N
x = random number between 0 and 1 (uniform distribution)
if x < counts[0] / N
randomized_counts[site] = 0
else if x < (counts[0] + counts[1]) / N
randomized_counts[site] = 1
else if x < (counts[0] + counts[1] + counts[2]) / N
randomized_counts[site] = 2
else
randomized_counts[site] = 3
This randomization tends to eliminate the clustering of read starts
due to copy number variation. Note, however, that we are still
preserving the distribution of read start counts. As a result, this approach is expected to be more conservative than
just randomly locating read starts across the sequence. The reason for doing
things in this way is to allow for the fact that factors other than CNVs, such as
library amplification, can also cause clustering of read starts at a
particular site.