Genome 540 Homework Assignment 1

Due Friday Jan. 16

Policy on late homework: It will be accepted, but penalized.

Read The evolutionary origin of complex features. R.E. Lenski, C. Ofria, R.T. Pennock, and C. Adami. Nature 423 (2003) 139-145
Download and begin reading Initial sequencing and analysis of the human genome. The Genome International Sequencing Consortium. Nature 409, 860-921 (15 February 2001) . (To print this out, I would recommend print the pdf format which corresponds exactly to the printed version, rather than the html format.) For next week, read
- introduction and background (pp. 860-863, up to but not including "Strategic issues")
- the section "Broad genomic landscape" (pp. 875-879, up to but not including "Repeat content of the human genome")
- the section "Gene content of the human genome" (starting p. 892) up to but not including "comparative proteome analysis" (p. 901).
Spend an hour or two exploring the NCBI web site, following as many links as possible, reading as much material as you can, and getting an idea of the overall structure of the site.
Find a bacterium or archaeon for which the complete genome sequence is available on that site and at least 500,000 bases in length, and for which one of the organism's initials (i.e. the first letter of its first name, or the first letter of its last name) is the same as one of your initials (if none of the organisms has initials meeting this condition, choose one at random). For this organism, find a file in "FASTA" format (i.e. having a header line which starts with the character ">" and includes the organism name, with the sequence itself following on subsequent lines) containing the complete genome sequence; this file will have a name with the extension ".fna". Download this file.
Write a program which reads in the file you downloaded in 3, counts all the nucleotides of each type (i.e. the number of A's, the number of C's, etc. including ambiguously coded ones (N,R,Y, etc), if any) in the sequence; and prints out
1. the name of the file (this should be the same as the file name on the NCBI web site -- i.e. don't rename it!)
2. the header line of the file
3. a table indicating the nucleotide counts, and the total number of nucleotides.
Email the output from running your program to me at phg@u.washington.edu AND to Chris Saunders . Please make it as compact as possible. Do NOT send the code itself. Include the output in the body of your email message (as plain text), NOT as an attachment.
(Not to hand in -- this is a test of whether your basic programming skills will be adequate for future assignments): Write a program that generates 5 million random numbers between 0 and 1 and sorts them by increasing size. It should take you no more than 1/2 hour to write this program, and it should run in a few minutes or less.