Genome 540 Homework Assignment 1

Due Sunday Jan. 16

Policy on late homework: It will be accepted, but penalized.

Read The evolutionary origin of complex features. R.E. Lenski, C. Ofria, R.T. Pennock, and C. Adami. Nature 423 (2003) 139-145
Download and begin reading Initial sequencing and analysis of the human genome. The Genome International Sequencing Consortium. Nature 409, 860-921 (15 February 2001) . (To print this out, I would recommend the pdf format which corresponds exactly to the printed version, rather than the html format.) For next week, read
- introduction and background (pp. 860-863, up to but not including "Strategic issues")
- the section "Broad genomic landscape" (pp. 875-879, up to but not including "Repeat content of the human genome")
- the section "Gene content of the human genome" (starting p. 892) up to but not including "comparative proteome analysis" (p. 901).
Write a program that implements the method described in Lecture 2 to find the longest exactly matching sequence between two genomes. Specifically, your program should read two input files in "FASTA" format (i.e. having a header line which starts with the character ">" and includes the organism name, with the sequence itself following on subsequent lines), each of which contains the complete sequence for one of the genomes. As a data check, your program should count the number of bases of each type in each genome. It should then use the described algorithm to find the longest subsequence present in both genomes. If there are several different perfectly repeated sequences of the same length, find all of them. When you are looking for these, consider both strands of each genome (i.e. both the sequence given in the FASTA file, which is sometimes called the 'forward' strand, and its reverse complement, which is sometimes called the 'reverse' strand) simultaneously, so that if the repeated sequence happens to occur on different strands in the two genomes you will still find it. The easiest way to do this is to create and store in memory a single sequence of length 2N + 2M (where the two input genome sequences have lengths N and M bases respectively) constructed by concatenating together the two genome sequences and their reverse complements, and then create a single list of pointers to each position in this merged sequence, which you then sort.
Using this program, you should then find the longest exactly matching sequences in the genomes of Pseudomonas aeruginosa (strain PA01) and Salmonella typhimurium (strain LT2), which are two very distantly related bacteria. You can find the FASTA files for these and other bacterial genomes by going to the NCBI web site and following appropriate links. (On the NCBI site, the FASTA files containing the full genome sequences have the suffix .fna). Once you find the longest matches using your program, you should then determine what biological feature they correspond to. Do this by looking in the 'Genbank format' files for the organisms (which have the suffix .gbk on the NCBI site) and find the annotated 'feature' in each genome that overlaps the matching segment you found. You don't need to write a program to do this -- just can just read the .gbk file on the web site.
To test whether your program is working correctly, run it first on the test example indicated below (with two different Mycoplasma species) to see whether you get the right answer.
You must turn in your results and your computer program, using the template (file format) described below. Please put everything into ONE file - do not send an archive of files or a tar file. After creating a plain text file (NOT a word processing document file) in this format, compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs let us know), and send it as an attachment to both Phil at phg@u.washington.edu, and Tobias at mann@gs.washington.edu. The indentation and line breaks below are for readability, and can be omitted. For instance, either of the following is okay:

<result> some text </result> or, <result> some text </result> Here is the template: <gs540_hw assignment='1' name='student name' email='student email'> <results> <result type='first line' file='filename'> first line of one FASTA file </result> <result type='first line' file='filename'> first line of the other FASTA file </result> <result type='nucleotide histogram' file='filename'> Nucleotide histograms should give, for each base or 'ambiguity code' occurring in the sequence, the letter denoting the base, followed by an equals sign, followed by an integer giving the number of times the base occurs in the sequence. Put a comma between the different bases. For instance, A=50,C=50,G=50,T=50,N=2 </result> <result type='nucleotide histogram' file='filename'> A nucleotide histogram for the other fasta file </result> <result type='DNA sequence'> <location file='filename' strand='forward' or 'reverse'> Put the location of the matching sequence within the chromosome here. The location should be the index of the DNA base in the sequence that is closest to the beginning of the forward strand. Use a coordinate system starting at 1 rather than 0. For example, if the two chromosomal strands are: 5'-ACTGA-3' 3'-TGACT-5' and you found the sequence TCA on the reverse strand to be the longest match to the other genome, then the location should be reported as 3. If instead you found CTG on the forward strand, then the location should be reported as 2. </location> <location file='filename' strand='forward' or 'reverse'> Put the location of the matching sequence on the other chromosome here. </location> longest shared DNA sequence goes here </result> <result type='DNA sequence'> (give additional matches of same length, if any, in same format as above ... ) </result> </results> <analysis> Put a short identification of the shared DNA sequence here, for instance: "This DNA sequence is the first 20 bases of an RNA polymerase gene" </analysis> <program> <comments> Any comments about your code or files should go here. </comments> <file name='filename'> file contents here. </file> </program> </gs540_hw> Here is an example of a homework file with the fields filled in with the correct answers for a test case (the program is given only in abbreviated form). <gs540_hw assignment='1' name='Tobias Mann' email='mann@gs.washington.edu'> <results> <result type='first line' file='NC_000912.fna'> >gi|13507739|ref|NC_000912.1| Mycoplasma pneumoniae M129, complete genome </result> <result type='first line' file='NC_004829.fna'> >gi|31544204|ref|NC_004829.1| Mycoplasma gallisepticum R, complete genome </result> <result type='nucleotide histogram' file='NC_000912.fna'> A=249211,C=162920,T=240560,G=163703 </result> <result type='nucleotide histogram' file='NC_004829.fna'> A=343648,C=156658,T=339399,G=156717 </result> <result type='DNA sequence'> <location file='NC_000912.fna' strand='forward'> 122006 </location> <location file='NC_004829.fna' strand='forward'> 324176 </location> GTCGGGTAAATTCCGTCCCGCTTGAATGGTGTAACCATCTCTTGACTGTCTCGGCTATAG ACTCGGTGAAATCCAGGTACGGGTGAAGACACCCGTTAGGCGCAACGGGACGGAAAGACC CC </result> <result type='DNA sequence'> <location file='NC_000912.fna' strand='reverse'> 122006 </location> <location file='NC_004829.fna' strand='reverse'> 324176 </location> GGGGTCTTTCCGTCCCGTTGCGCCTAACGGGTGTCTTCACCCGTACCTGGATTTCACCGA GTCTATAGCCGAGACAGTCAAGAGATGGTTACACCATTCAAGCGGGACGGAATTTACCCG AC </result> </results> <analysis> This sequence comes from the 23S ribosomal RNA gene. </analysis> <program> <comments> run the bash file. It calls the python script, which writes results to standard out. </comments> <file name='hw1.bash'> python longest_repeated_subsequence.py -a NC_004829.fna -b NC_000912.fna </file> <file name='longest_repeated_subsequence.py'> """ this file reads in two FASTA files, and finds the longest exactly repeated sequence shared by the two organisms """ if __name__ == '__main__': parse_command_line() find_longest_repeated_subsequence() report_results() </file> </program> </gs540_hw>