Genome 540 Homework Assignment 8

Due Sun March 16

  1. Write a program (based on the 'D-segment' pseudocode given in lecture) to identify CpG islands as 'D-segments':
  2. Run the program on this chromosome 21 sequence file, outputting both the list of segments and the score histogram, using parameter values
  3. Derive the background distribution (frequencies associated to scores +17 and -1) from the chromosome 21 dinucleotide composition, and use it to estimate the scaling factor lambda (λ) needed to rescale the scoring scheme to (base e) LLR scores. This will require writing a program that computes the value of the function f(λ) defined in class (Lecture 17) and, by doing a search, finds the unique positive value of λ (to an accuracy of three decimal places) for which f(λ) = 1.
  4. Write a program that uses the conditional probabilities derived for chromosome 21 to simulate a sequence having ten times the length, and the same dinucleotide composition as chromosome 21, and output this as a fasta file.
  5. Run your program from #1 above on the simulated sequence, outputting everything except the list of segments, using
  6. Compare the number of segments found in the simulated sequence at particular scores, to the number predicted by the Karlin-Altschul theory, as follows:
  7. Is the score threshold you used for CpG islands on chromosome 21 a reasonable one? Why or why not?
  8. You must turn in your results and your computer program, using this template file . Please put everything into ONE plain text file - do not send an archive of files or a tar file, or a word processing document file. Compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs let us know), and send it as an attachment to both Phil and Will. (The XML file includes a DTD, which specifies the XML file format. Place the DTD at the beginning of your XML document. When you are done with your XML check to make sure it conforms to the DTD using this website: http://www.stg.brown.edu/service/xmlvalid and resolve any errors before turning it in.)