Genome 540 Homework Assignment 4
Due Sunday Feb. 6
- Write a program that does the following, for the DNA sequence of Dictyostelium discoideum AX4 chromosome 1:
- Reads the Genbank file (with suffix .gbk), and from the CDS FEATURES entries infers the locations of the starts of the coding sequences on both strands. CDS coordinates containing a '<' or '>' should be ignored (these symbols indicate that the precise start or end of the coding sequence is uncertain).
- Use the information in 1) to compute count and frequency matrices (of the type presented in lecture 8 for C. elegans splice sites) for the translation start sites. These should extend from position -10 (i.e. 10 bases upstream of the first base of the start codon) to position +10 (i.e. 10 bases downstream of that base) -- 21 bases in all. To generate this you will need to read in the genome sequence (which appears later in the Genbank file), and to complement it in order to handle genes on the opposite strand correctly. Ns in the sequence (unknown bases) should be ignored when computing these matrices.
- Compute a site weight matrix using the frequency table for the translation start sites, together with the genome nucleotide frequencies. Entries in the weight matrix should be the log, to the base 2, of
the ratio of the appropriate frequencies. Use -99.0 as the weight for any cells that have frequency 0 in the translation start sites.
- Using the weight matrix from (3), generate two score histograms (using a bin size of 1 for the scores):
- a histogram of the scores of all "true" translation start sites (i.e. the ones used to construct the site frequency table)
- a histogram of the scores of all positions in the actual genome sequence (and its complement). When calculating scores, sequence positions that are N should be given a score of 0.
- Generate a list of all positions in the genome and its complement that have scores >= 5.0 but which do NOT correspond to annotated translation start sites.
Your output should conform to the template specified below.
- You must turn in your results and your computer
program, using this template file .
Please put everything into ONE plain text file - do not send an archive
of files or a tar file, or a word processing document file. Compress it (using either Unix compress, or gzip -- if
you don't have access to either of these programs let us know), and
send it as an attachment to both Phil (phg (at) u.washington.edu) and
Rupali (rpatward (at) u.washington.edu).
The XML file includes a DTD, which specifies the XML file format. Place the DTD at the beginning of your XML document. When you are done with your XML check to make
sure it conforms to the DTD using this website and resolve any errors before turning it in. Warnings are acceptable.