Genome 540 Homework Assignment 1

Due Sunday Jan. 19, 11:59pm

Late homework policy is described on course web page.

  1. Download and begin reading Initial sequencing and analysis of the human genome. The Genome International Sequencing Consortium. Nature 409, 860-921 (15 February 2001) . (To print this out, I would recommend the pdf format which corresponds exactly to the printed version, rather than the html format.) For next week, read:
  2. Write a program that implements the method described in Lecture 2 to find the longest exactly matching subsequence between two sequences. Specifically, your program should:
  3. Using this program, you should then find the longest exactly matching sequences in the Escherichia coli genome and the Lactobacillus plantarum genome. Once you find the longest match(es) using your program, you should then try to figure out what biological feature they correspond to. You can do this by exploring the features tables for both organisms here and here.
  4. To test whether your program is working correctly, run it first on the test example indicated below (with two different bacterial genomes) to see whether you get the right answer. The FASTA files for the test examples can be found here and here. The General feature format files are here and here. In general, FASTA files for these and other bacterial genomes can be found by going to the NCBI web site and following appropriate links. (On the NCBI site, the FASTA files containing the full genome sequences have the suffix .fna. To find the biological features, look in the 'General feature format' or 'Genbank format' files for the organisms (which have the suffix .gff and .gbk on the NCBI site) and find the annotated 'feature' in each genome that overlaps the matching segment you found. You don't need to write a program to do this -- you can just read the .gff or .gbk file on the website.) .gff files are also searchable by 'bedtools intersect -a {your.gff} -b {region_of_interest.bed}'.
  5. You must turn in your results and your computer program, using the template (file format) described below. Please put everything into ONE file - do not send an archive of files or a tar file. After creating a plain text file (NOT a word processing document file) in this format, compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs, let us know), and send it as an attachment to both Phil at phg@uw.edu and Mitchell at mvollger@uw.edu.
Between the "=============" below is an example of a homework file with the fields filled in with the correct answers for the test case (see genomes provided in #4 above). Several items are explained in detail in the notes below the template:

=============
Assignment: GS 540 HW1
Name: Mitchell Vollger
Email: mvollger@uw.edu
Language: C++
Runtime: 3.5 sec

>gi|284930242|gb|CP001872.1| Mycoplasma gallisepticum str. R(high), complete genome
*=1012027
A=349322
C=159094
G=159365
T=344246
N=0

>gi|440453185|gb|CP003913.1| Mycoplasma pneumoniae M129-B7, complete genome
*=816373
A=249201
C=162924
G=163697
T=240551
N=0

Match length: 122
Number of match strings: 2

Match string: GGGGTCTTTCCGTCCCGTTGCGCCTAACGGGTGTCTTCACCCGTACCTGGATTTCACCGAGTCTATAGCCGAGACAGTCAAGAGATGGTTACACCATTCAAGCGGGACGGAATTTACCCGAC
Description: This sequence comes from [look up entry in .gbff/.gff annotation file using the position information below].

Fasta: gi|284930242|gb|CP001872.1|
Position: 82469
Strand: reverse

Fasta: gi|440453185|gb|CP003913.1|
Position: 122005
Strand: reverse

Fasta: gi|284930242|gb|CP001872.1|
Position: 338240
Strand: reverse

Match string: GTCGGGTAAATTCCGTCCCGCTTGAATGGTGTAACCATCTCTTGACTGTCTCGGCTATAGACTCGGTGAAATCCAGGTACGGGTGAAGACACCCGTTAGGCGCAACGGGACGGAAAGACCCC
Description: This sequence comes from [look up entry in .gbff/.gff annotation file using the position information below].

Fasta: gi|284930242|gb|CP001872.1|
Position: 338240
Strand: forward

Fasta: gi|440453185|gb|CP003913.1|
Position: 122005
Strand: forward

Fasta: gi|284930242|gb|CP001872.1|
Position: 82469
Strand: forward



Program:

int main() {
	do_analysis();
	return 0;
}

=============

Details:

Fasta: put the name of the fasta file, along with the first line.

Nucleotide histogram: count the total number of bases ("*") and the number of times each specific base occurs. i.e.:

*=1012800
A=349623
C=159237
G=159490
T=344450
N=0

Description: Put a short identification of the shared DNA subsequence, for instance: "This DNA sequence is the first 20 bases of an RNA polymerase gene."

Position: Put the location of the matching subsequence within the input sequence here. The location should be the index of the DNA base in the sequence that is closest to the beginning of the forward strand. Use a coordinate system starting at 1 rather than 0. For example, if the two chromosomal strands are:

			 5'-ACTGA-3'
			 3'-TGACT-5'
and you found the subsequence TCA on the reverse strand to be the longest match to the other sequence, then the location should be reported as 3. If instead you found CTG on the forward strand, then the location should be reported as 2.