Genome 540 Homework Assignment 8

Due Sun March 16

Write a program (based on the 'D-segment' pseudocode given in lecture) to identify CpG islands as 'D-segments':
- The scoring scheme should be the same one as in the draft human genome paper:
  - each CpG (more precisely each C that is immediately followed by a G) gets a score of +17
  - any other nucleotide (i.e. any non-C, or any C that is not immediately followed by a G) gets a score of -1.
- Allow user settable parameter values
  - S (= score threshold)
  - D (= dropoff threshold)
- Your program should output
  - the name & first line of the input sequence file.
  - the dinucleotide composition of the input sequence (do not count dinucleotides with N's or other ambiguity characters), giving each of the following:
    - counts (# of times each dinuc occurs)
    - frequencies (counts / total no. of dinucs)
    - conditional probability of a nucleotide occurring, given the preceding nucleotide
  - (optionally) a list of the segments (starting & ending positions, and scores),
  - a score histogram (number of segments at each score value, for each score that actually occurs).
Run the program on this chromosome 21 sequence file, outputting both the list of segments and the score histogram, using parameter values
- S = 100.
- D = -40.
Derive the background distribution (frequencies associated to scores +17 and -1) from the chromosome 21 dinucleotide composition, and use it to estimate the scaling factor lambda (λ) needed to rescale the scoring scheme to (base e) LLR scores. This will require writing a program that computes the value of the function f(λ) defined in class (Lecture 17) and, by doing a search, finds the unique positive value of λ (to an accuracy of three decimal places) for which f(λ) = 1.
Write a program that uses the conditional probabilities derived for chromosome 21 to simulate a sequence having ten times the length, and the same dinucleotide composition as chromosome 21, and output this as a fasta file.
Run your program from #1 above on the simulated sequence, outputting everything except the list of segments, using
- S = 40.
- D = -40.
Compare the number of segments found in the simulated sequence at particular scores, to the number predicted by the Karlin-Altschul theory, as follows:
- for each integer s between 40 and 100 (inclusive), find the number of segments having score s or higher. Call this h(s).
- print out a table listing for each s, the values of h(s), h(s) / h(40), and the predicted value of h(s) / h(40) from the Karlin-Altschul formula (note that you don't need to know K!)
Is the score threshold you used for CpG islands on chromosome 21 a reasonable one? Why or why not?
You must turn in your results and your computer program, using this template file . Please put everything into ONE plain text file - do not send an archive of files or a tar file, or a word processing document file. Compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs let us know), and send it as an attachment to both Phil and Will. (The XML file includes a DTD, which specifies the XML file format. Place the DTD at the beginning of your XML document. When you are done with your XML check to make sure it conforms to the DTD using this website: http://www.stg.brown.edu/service/xmlvalid and resolve any errors before turning it in.)