Genome 540 Homework Assignment 2
Due Sunday Jan. 23
- Write a program that
- reads in the FASTA file of coding sequences
for an organism (on the NCBI web site this file has the suffix .ffn, and you should be able to
find it in the same place that you found the .fna file having the
complete genome sequence);
- computes a "codon usage table" that counts
the total number of times each codon is used (in-frame) in all of the coding sequences (ignoring codons that contain ambiguous nucleotides -- N,R,Y, etc.); and
- prints out (in the format described below)
- the name of the
FASTA file
- the first line of the FASTA file
- the number of
sequences, and total number of codons, in the FASTA file
- the number of times each codon is used
- Write a program to find a list of non-overlapping predicted coding sequences in the
organism. Specifically, this should
- identify all ORFs (open
reading frames) that start with the codon ATG, GTG, CTG, or TTG, end with a stop
codon in the same reading frame (with no intervening in-frame stop codons), and have
length at least 100 codons; be sure to do this for both strands of the genome
sequence (i.e. the strand given in the .fna file, and its reverse
complement); codons that contain ambiguous nucleotides should be counted when determining the length, but should not be eligible to be stop or start codons
- prune this list of ORFs to eliminate overlapping sequences (removing the shorter one of an overlapping pair), by
using the following procedure: first sort the list of ORFs by
decreasing size (ORFs of the same size should be ordered by where they appear in the sequence, in forward-strand coordinates -- with ORFs nearest the beginning of the sequence appearing first); then work through the list one at a time (starting
with the largest ORF), accepting an ORF if it does not overlap any previously accepted ORF, otherwise rejecting it (be sure to reject ORFs that overlap longer ORFs on the opposite strand as well as the same strand);
- write out (in the format described below) the translation start and stop positions, in forward-strand co-ordinates, and the strand, of each
accepted ORF;
- write a FASTA file containing all the accepted ORFs.
- Run the program you wrote in #1 above on the FASTA file you generated
in #2, so as to generate codon counts for the ORFs
you found.
- For the homework assignment, you should run your programs on Salmonella typhimurium (strain LT2) (one of the organisms used in the first assignment). To test whether your program is working correctly, run it first
on the test example Pseudomonas aeruginosa (strain
PA01)
to see whether you get the right answer.
- You must turn in your results and your computer
program, using the template (file format) described below. Please put everything into ONE file - do not send an archive
of files or a tar file.
After creating a plain text file (NOT a word processing document file)
in this format, compress it (using either Unix compress, or gzip -- if
you don't have access to either of these programs let us know), and
send it as an attachment to both Phil at phg@u.washington.edu, and
Tobias at mann@gs.washington.edu.
Here is the template:
first line of the .ffn file that you use to compute the
codon table
Put the codon table you computed from the .ffn file for
the analyzed genome here.
Codon histograms should consist of a base triplet,
followed by an equals sign, followed by an integer.
Put a comma between the different codons. For instance,
a codon table might start like this:
UUU=50,UCU=50,UAU=50,UGU=50,
Codons can be reported in any order.
Put the codon table for the ORFS you found in part 2 of the
assignment here. Report the name of the FASTA file you
used to find the non-overlapping orfs.
each orf should be reported in a single line as above; they should appear
in the order found by your program, according to the algorithm described
above (i.e. largest first).
Start and stop are translation start and stop (first coding base,
and last base of the stop codon) given in forward strand co-ordinates.
Any comments about your code or files should
go here.
file contents here.
Below is an example of a homework file with the fields filled in
with the correct answers for a test case (the program is not complete!)
>ref|NC_002516.1|:483-2027
UUU=3212,UCU=1553,UAU=9810,UGU=1866,
UUC=62910,UCC=22488,UAC=37382,UGC=16795,
UUA=529,UCA=1087,UAA=526,UGA=4402,
UUG=16303,UCG=24300,UAG=639,UGG=27634,
CUU=5767,CCU=3954,CAU=11695,CGU=14779,
CUC=51800,CCC=24327,CAC=28687,CGC=91770,
CUA=2612,CCA=4055,CAA=11649,CGA=4413,
CUG=154699,CCG=62113,CAG=67556,CGG=26244,
AUU=5332,ACU=3088,AAU=6967,AGU=4892,
AUC=70529,ACC=61133,AAC=42153,AGC=48367,
AUA=1763,ACA=1513,AAA=6688,AGA=889,
AUG=37688,ACG=11827,AAG=46656,AGG=3751,
GUU=5083,GCU=8973,GAU=19510,GGU=15426,
GUC=53743,GCC=126314,GAC=79468,GGC=115499,
GUA=7442,GCA=9047,GAA=43631,GGA=7754,
GUG=62357,GCG=72463,GAG=69613,GGG=18460
UUU=6641,UCU=4263,UAU=6547,UGU=3768,
UUC=45186,UCC=18441,UAC=20743,UGC=16352,
UUA=1497,UCA=4257,UAA=553,UGA=2738,
UUG=15288,UCG=20321,UAG=1103,UGG=19003,
CUU=27054,CCU=22153,CAU=27123,CGU=19974,
CUC=51066,CCC=24296,CAC=40364,CGC=85575,
CUA=7740,CCA=21638,CAA=19720,CGA=32483,
CUG=99897,CCG=50499,CAG=87200,CGG=54266,
AUU=6769,ACU=6743,AAU=6998,AGU=5841,
AUC=39437,ACC=36292,AAC=23632,AGC=31247,
AUA=5117,ACA=3946,AAA=6252,AGA=3524,
AUG=23311,ACG=15739,AAG=24921,AGG=9863,
GUU=23729,GCU=37004,GAU=43007,GGU=41022,
GUC=53919,GCC=107598,GAC=59820,GGC=111015,
GUA=18187,GCA=21170,GAA=46706,GGA=23496,
GUG=40673,GCG=79775,GAG=51312,GGG=25529
run the bash file. It calls the python
script, which writes results to standard out.
python compute_codon_table.py -f NC_000912.ffn
python find_nonoverlapping_orfs.py -f NC_000912.fna -o1 orfs.txt -o2 orfs.ffn
python compute_codon_table.py -f orfs.ffn
"""
This reads in a FFN file (which has multiple fasta records)
and computes the codon tables
"""
if __name__ == '__main__':
parse_command_line()
compute_codon_table()
report_results()
"""
This reads in a FASTA file
and finds the non-overlapping orfs
"""
if __name__ == '__main__':
parse_command_line()
find_nonoverlapping_orfs()
report_results()