Genome 540 Homework Assignment 5

Due Sunday Feb 13

Write a program that does the following, for the genome sequence Pseudomonas aeruginosa (strain PA01):
1. Reads the Genbank file (with suffix .gbk), and from the FEATURES entries infers the locations of the starts of the coding sequences on both strands.
2. Uses the information in 1) to compute count and frequency matrices (of the type presented in lecture 10 for C. elegans splice sites) for the translation start sites. These should extend from position -10 (i.e. 10 bases upstream of the first base of the start codon) to position +10 (i.e. 10 bases downstream of that base) -- 21 bases in all. To generate this you will need to read in the genome sequence (which appears later in the Genbank file), and to complement it in order to handle genes on the opposite strand correctly.
3. Compute a site weight matrix using the frequency table for the translation start sites, together with the genome nucleotide frequencies. Entries in the weight matrix should be the log, to the base 2, of the ratio of the appropriate frequencies. Use -99.0 as the weight for any cells that have frequency 0 in the translation start sites.
4. Using the weight matrix from (3), generate two score histograms (using a bin size of 1 for the scores):
  1. a histogram of the scores of all "true" translation start sites (i.e. the ones used to construct the site frequency table)
  2. a histogram of the scores of all positions in the actual genome sequence (and its complement)
5. Generate a list of all positions in the genome and its complement that have scores >= 5.0 but which do NOT correspond to annotated translation start sites.
Your output should conform to the template specified below.
You must turn in your results and your computer program, using the template (file format) described below. Please put everything into ONE plain text file - do not send an archive of files or a tar file, or a word processing document file. Compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs let us know), and send it as an attachment to both Phil at phg@u.washington.edu, and Tobias at mann@gs.washington.edu. Here is the template: <gs540_hw assignment='4' name='student name' email='student email'> <results> <result type='first line' file='filename'> first line of the .gbk file </result> <result type='nucleotide histogram' file='filename'> This should give, for each base or 'ambiguity code' occurring in the sequence, the letter denoting the base, followed by an equals sign, followed by an integer giving the number of times the base occurs in the sequence and its complement. Put a comma between the different bases. E.g. A=50,C=50,G=50,T=50,N=2 </result> <result type='background frequency' file='filename'> Like nucleotide histogram, but giving fraction of times (to 4 decimal places) each nucleotide occurs in the sequence and its complement. In computing these, ignore ambiguity-coded nucleotides. E.g. for the counts given as above one would get A=.2500,C=.2500,G=.2500,T=.2500 </result> <result type='count matrix' file='filename'> Put the matrix of nucleotide counts at each position in known translation start sites here, as a list (pos,nuc)=count,... For example (-10,A)=13,(-10,C)=103,(-10,G)=105,(-10,T)=15,(-9,A)=27, ... where the interpretation is that nucleotide A occurs 13 times at position -10 in known translation start sites, etc. Ignore occurrences of ambiguity-coded nucleotides at each position. </result> <result type='frequency matrix' file='filename'> Like count matrix, but indicating the fraction of times (to 4 decimal places) each nucleotide occurs at each position, rather than the total counts: e.g. (-10,A)=.0551,(-10,C)=.4364, ... </result> <result type='weight matrix' file='filename'> Like frequency matrix, but giving weight. Give values to three decimal places: e.g. (-10,A)=-4.184, ... </result> <result type='score histogram' positions='true sites' file='filename'> This should be a list of the form (i,n) where i is an integer and n gives the number of times a score >= i and < i+1 occurred, for the true start sites. Omit cases i if no score in that range was observed. Also omit all i's corresponding to scores < -50; but include an entry (<-50,n) indicating the number of times a score less than -50 occurred. E.g. (<-50,403),(-50,35),(-49,17),... </result> <result type='score histogram' positions='all' file='filename'> As above, but for all positions in the genome (and its complement). </result> <result type='position list' file='filename'> A list of positions in the genome where scores >= 5.0 occurred but which do NOT correspond to an annotated translation start site. These should be given in the form (p,strand,score) where p indicates position (in top strand, origin 1 co-ordinates), strand = 0 (for top) or 1 (for bottom), and score is given to 3 decimal places. E.g. (15774,0,5.310),(16007,1,7.632),... </result> </results> <program> <comments> put comments about your code here </comments> <file> file contents here </file> </program> </gs540_hw>