MBT 541 Homework Assignment 6

Due Friday June 7

Write a program to find CpG islands in this 1.8 Mb stretch of human genome sequence , as follows:
1. Using a scoring system that attaches a score of 17 to each C that is the first base in a CpG dinucleotide, and -1 to every other base, find all maximal segments in the sequence with scores at least 75.
2. Many of these segments fall within interspersed repeats, and are not true unmethylated CpG islands (the original repeat was presumably rich in CpGs, but after insertion the sequence becomes methylated and the CpG's will eventually mutate away given enough time). To distinguish the true CpG islands from those in interspersed repeats, your program should parse the annotations of repeats in the above file, which look like this:
```
      repeat_region   complement(2529..2627)
                     /rpt_family="L1M4"
```
  For each segment you find in part 1, if at least 80% of its sequence lies within an annotated repeat, then you should attach the name of the repeat (in the above case, LIM4) to the segment. If the segment overlaps or contains an annotated repeat but at least 20% of the segment lies outside the repeat, then do not attach the repeat name to it.
3. Also parse the annotated mRNA transcripts in the file, which look like this:
```
     mRNA            join(156367..156697,157026..157213,162737..165293)
                     /gene="CAV2"
                     /product="caveolin 2"
```
  You should find that essentially every island found in part 1 that does not correspond to a repeat, does lie at or near the 5' end of one of these mRNAs. (You can do this manually if you like, by looking through the file above for the mRNA annotations).
4. Print out a list of the segment locations and their scores, with attached repeat and mRNA names if any, in the order in which they appear on the chromosome, e.g.
```
 45317..45900    87 AluY
 155201..157003 261 CAV2
 183107..187009 103 L1M4
   .
   .
   .
```
5. Find a null score distribution for high-scoring segments, as follows: Let n be the total number of CpG dinucleotides in the 1.8Mb sequence above, and let f be their frequency in the sequence, i.e. f = n / 1800000. Simulate 1000 vectors of length 1800000, each consisting entirely of the integers 17 and -1, by choosing at each position the value 17 with probability f and the value -1 with probability 1 - f. In each simulated vector, find all maximal segments of score 25 or higher. Construct a histogram listing for each possible integral score from 25 to 100, the number of maximal segments you found (in all of the replicas) that had that score or greater. Divide this by 1000 to give the average number per replica. Print this out in the following format:
```
100  2  .002
 99  4  .004
 98  4  .004
  .
  .
  .
 25 13001 13.001
```
6. Based on the histogram, what is the probability that a segment with score 75 or higher will occur by chance in a sequence of length 1800000 having the same number of CpGs as the original?
Email the above (i.e. list of segments; histogram of scores in the simulated sequences; and the probability of a score of 75 or higher occurring by chance) to me at phg@u.washington.edu. Please make it as compact as possible. Do NOT send the code itself. Include the output in the body of your email message (as plain text), NOT as an attachment.