Genome 540 Homework Assignment 1
Due Sunday Jan. 17
Late homework policy is described on course web page.
- Download and begin reading Initial sequencing and analysis of the human genome. The Genome
International Sequencing Consortium. Nature 409, 860-921 (15
February 2001) . (To print this out, I would recommend the
pdf format which corresponds exactly to the printed version, rather
than the html format.) For next week, read
- introduction and background (pp. 860-863, up to but not including "Strategic issues")
- the section "Broad genomic landscape" (pp. 875-879, up to but not including "Repeat content of the human genome")
- the section "Gene content of the human genome" (starting p.
892) up to but not including "comparative proteome analysis" (p. 901).
- Write a program that implements the method described in Lecture 1
to find the longest exactly matching subsequence between two
sequences. Specifically, your program should read two input files in
"FASTA" format (i.e. having a header line which starts with the
character ">" and includes the sequence name, with the sequence
itself following on subsequent lines), each of which contains one of
the sequences to be compared. As a data check, your program should
count the number of bases of each type in each sequence. It should
then use the described algorithm to find the longest subsequence
present in both sequences. If there are several different perfectly
repeated subsequences of the same length, find all of them. When you
are looking for these, consider both strands of each sequence
(i.e. both the sequence given in the FASTA file, which is sometimes
called the 'forward' strand, and its reverse complement, which is
sometimes called the 'reverse' strand) simultaneously, so that if the
repeated subsequence happens to occur on different strands in the two
sequences you will still find it. The easiest way to do this is to
create and store in memory a single sequence of length 2N + 2M + 4
(where the two input sequences have lengths N and M bases
respectively) constructed by concatenating together the two sequences
and their reverse complements (each terminated by a null character to
terminate the strings, so that the lexicographic sort does not read
across the sequence boundaries), and then create a single list of
pointers to each position in this merged sequence, which you then
sort. If a longest subsequence is present multiple times in either
genome, please report all locations of the subsequence.
- Using this program, you should then find the longest exactly
matching sequences in
the
Bacillus subtilis genome and
the
Pseudomonas aeruginosa genome. Once you find the longest match(es)
using your program, you should then try to figure out what biological
feature they correspond to. You can do this by exploring the features
tables for both organisms here
and here.
- To test whether your program is working correctly, run it first
on the test example indicated below (with two different bacterial
genomes) to see whether you get the right answer. The FASTA files for
the test examples can be
found here
and here.
The Genbank files
are here
and here.
(All of these files from the NCBI FTP site need to be unzipped---for
example, with gunzip.) In general, FASTA files for these and other
bacterial genomes can be found by going to
the NCBI website
or FTP
site and following appropriate links. FASTA files have a .fna
extension and Genbank files have a .gb or .gbff extension. (To find
the biological features, look in the 'Genbank format' files for the
organisms and find the annotated 'feature' in each genome that
overlaps the matching segment you found. You don't need to write a
program to do this---you can just read the .gb file on the web
site.)
- You must turn in your results and your computer
program, using the template (file format) described below. Please put everything into ONE file - do not send an archive
of files or a tar file.
After creating a plain text file (NOT a word processing document file)
in this format, compress it (using either Unix compress, or gzip---if
you don't have access to either of these programs let us know), and
send it as an attachment to both Phil at phg@u.washington.edu, and
Anne at aclark4@uw.edu.
Here is an example of a homework file with the fields filled in with
the correct answers for the test case, between the "============="
(several items are explained in detail below the template):
=============
Assignment: GS 540 HW1
Name: Anne Clark
Email: aclark4@uw.edu
Fasta 1: GCA_000027345.1_ASM2734v1_genomic.fna
>U00089.2 Mycoplasma pneumoniae M129, complete genome
*=816394
A=249211
C=162920
G=163703
T=240560
Fasta 2: GCA_000025365.1_ASM2536v1_genomic.fna
>CP001872.1 Mycoplasma gallisepticum str. R(high), complete genome
*=1012027
A=349322
C=159094
G=159365
T=344246
Match length: 122
Number of match strings: 1
Match_string: GTCGGGTAAATTCCGTCCCGCTTGAATGGTGTAACCATCTCTTGACTGTCTCGGCTATAGACTCGGTGAAATCCAGGTACGGGTGAAGACACCCGTTAGGCGCAACGGGACGGAAAGACCCC
Description:
Fasta: GCA_000027345.1_ASM2734v1_genomic.fna
Position: 122006
Strand: forward
Fasta: GCA_000025365.1_ASM2536v1_genomic.fna
Position: 82469
Strand: forward
Fasta: GCA_000025365.1_ASM2536v1_genomic.fna
Position: 338240
Strand: forward
Program:
int main() {
do_analysis();
return 0;
}
=============
Details:
Fasta: put the name of the fasta file, along with the first line.
Nucleotide histogram: count the total number of bases ("*") and the number of times each specific base occurs. i.e.:
A=349623
C=159237
G=159490
T=344450
N=0
*=1012800
Description: Put a short identification of the shared DNA subsequence, for instance:
"This DNA sequence is the first 20 bases of an RNA polymerase gene"
Position: Put the location of the matching subsequence
within the input sequence here. The location should be the index of
the DNA base in the sequence that is closest to the beginning of the
forward strand. Use a coordinate system starting at 1 rather than 0.
For example, if the two chromosomal strands are:
5'-ACTGA-3'
3'-TGACT-5'
and you found the subsequence TCA on the reverse strand to be the
longest match to the other sequence, then the location should be
reported as 3. If instead you found CTG on the forward strand, then
the location should be reported as 2.