Write a program that does the following, for the same genome sequence you used in assignments 1-4:
Reads the Genbank file (with suffix .gbk), and from the FEATURES entries infers the locations of the starts of the coding sequences on both strands.
Uses the information in 1) to compute count and frequency matrices (of the type presented in lectures 11-12 for C. elegans splice sites) for the translation start sites. These should extend from position -10 (i.e. 10 bases upstream of the first base of the start codon) to position +10 (i.e. 10 bases downstream of that base). To generate this you will need to read in the genome sequence (which appears later in the Genbank file), and to complement it in order to handle genes on the opposite strand correctly.
Compute a site weight matrix using the frequency table for the translation start sites, together with the genome nucleotide frequencies you found in HW1. Entries in the weight matrix should be the log, to the base 2, of
the ratio of the appropriate frequencies. Use -99.0 as the weight for any cells that have frequency 0 in the translation start sites.
Simulate a random sequence that has the same nucleotide frequencies, and the same length, as the original genome sequence. It is OK to assume independence for this (i.e. each successive nucleotide in the sequence can be chosen independently of the preceding nucleotides). Compute the nucleotide counts for the simulated sequence to verify that the frequencies are what you expected.
Using the weight matrix from (3), generate three score histograms (using a bin size of 1 for the scores):
a histogram of the scores of all "true" translation start sites (i.e. the ones used to construct the site frequency table)
a histogram of the scores of all positions in the actual genome sequence (and its complement)
a histogram of the scores of all positions in the simulated genome sequence (and its complement)
Your output should provide
the name and first line of the Genbank file
the matrix of nucleotide counts at each position from -10 to +10, and the corresponding matrix of frequencies.
the weight matrix (give values to three decimal places)
the nucleotide counts for the simulated sequence
the three histograms. Present these in the following form:
where each row gives the score value x, followed by the number of times a score >= x but < x+1 was observed.
Email the output above to me and Chris. Please make it as compact
as possible. Do NOT send the code itself. Include the output in the
body of your email message (as plain text), NOT as an attachment.