Genome 540 Homework Assignment 2
Due Sunday Jan. 25
For the same bacterium you used in assignment 1:
- Write a program that
- reads in the FASTA file of coding sequences
for the organism (this file has the suffix .ffn, and you should be able to
find it in the same place that you found the .fna file having the
complete genome sequence);
- computes a "codon usage table" that counts
the total number of times each codon is used (in-frame) in all of the coding sequences (ignoring codons that contain ambiguous nucleotides -- N,R,Y, etc.); and
- prints out
- the name of the
FASTA file
- the first line of the FASTA file
- the number of
sequences, and total number of codons, in the FASTA file
- the codon usage table, in the same
format that is used for the genetic code table (e.g. the upper left
hand corner gives the number of times that "TTT" is used as a codon).
- Write a program to find a list of non-overlapping predicted coding sequences in your
organism. Specifically, this should
- identify all ORFs (open
reading frames) that start with the codon ATG, GTG, CTG, or TTG, end with a stop
codon in the same reading frame (with no intervening in-frame stop codons), and have
length at least 100 codons; be sure to do this for both strands of the genome
sequence (i.e. the strand given in the .fna file, and its reverse
complement); codons that contain ambiguous nucleotides should be counted when determining the length, but should not be eligible to be stop or start codons
- prune this list of ORFs to eliminate overlapping sequences (removing the shorter one of an overlapping pair), by
using the following procedure: first sort the list of ORFs by
decreasing size; then work through the list one at a time (starting
with the largest ORF), accepting an ORF if it does not overlap any previously accepted ORF, otherwise rejecting it (be sure to reject ORFs that overlap longer ORFs on the opposite strand as well as the same strand);
- write a FASTA file containing all the accepted ORFs.
- Run the program you wrote in #1 above on the FASTA file you generated
in #2, so as to print out a codon usage table for the open reading frames
you found.
- Email the output from #1 and #3 (but not #2) to me at phg@u.washington.edu AND Chris at ctsa@u.washington.edu. Please
make it as compact as possible. Do NOT send the code itself. Include
the output in the body of your email message (as plain text), NOT as
an attachment.