Genome 540 Homework Assignment 1
Due Saturday Jan. 14
Policy on late homework: It will be accepted, but penalized.
- Download and begin reading Initial sequencing and analysis of the human genome. The Genome
International Sequencing Consortium. Nature 409, 860-921 (15
February 2001) . (To print this out, I would recommend the
pdf format which corresponds exactly to the printed version, rather
than the html format.) For next week, read
- introduction and background (pp. 860-863, up to but not including "Strategic issues")
- the section "Broad genomic landscape" (pp. 875-879, up to but not including "Repeat content of the human genome")
- the section "Gene content of the human genome" (starting p. 892) up to but not including "comparative proteome analysis" (p. 901).
- Write a program that implements the method described in Lecture 2
to find the longest exactly matching subsequence between two sequences.
Specifically, your program should read two input files in "FASTA"
format (i.e. having a header line which starts with the character ">"
and includes the sequence name, with the sequence itself following on
subsequent lines), each of which contains one of the sequences to be compared. As a data check, your program should count
the number of bases of each type in each sequence. It should then use
the described algorithm to find the longest subsequence present in both
sequences. If there are several different perfectly repeated subsequences
of the same length, find all of them. When you are looking for these,
consider both strands of each sequence (i.e. both the sequence
given in the FASTA file, which is sometimes called the 'forward'
strand, and its reverse complement, which is sometimes called the
'reverse' strand) simultaneously, so that if the repeated subsequence
happens to occur on different strands in the two sequences you will still find
it. The easiest way to do this is to create and store in memory a
single sequence of length 2N + 2M (where the two input
sequences have lengths N and M bases respectively) constructed by
concatenating together the two sequences and their reverse
complements, and then create a single list of pointers to each
position in this merged sequence, which you then sort.
- Using this program, you should then find the longest exactly
matching sequences in this one megabase sequence in the human genome and this one megabase sequence in the rat genome.
Once you find the longest matches using your program, you
should then try to figure out what biological feature they correspond to. Do this by exploring the UC Santa Cruz 'Genome Bioinformatics' web site.
- To test whether your program is working correctly, run it first
on the test example indicated below (with two different bacterial genomes)
to see whether you get the right answer. You can find the FASTA files for these and
other bacterial genomes by going to the NCBI web site and following
appropriate links. (On the NCBI site, the FASTA files containing the full genome sequences have the suffix
.fna). (To find the biological features, look
in the 'Genbank format' files for the organisms (which
have the suffix .gbk on the NCBI site) and find the annotated
'feature' in each genome that overlaps the matching segment you
found. You don't need to write a program to do this -- just can just
read the .gbk file on the web site.)
- You must turn in your results and your computer
program, using the template (file format) described below. Please put everything into ONE file - do not send an archive
of files or a tar file.
After creating a plain text file (NOT a word processing document file)
in this format, compress it (using either Unix compress, or gzip -- if
you don't have access to either of these programs let us know), and
send it as an attachment to both Phil at phg@u.washington.edu, and
Aaron at aklammer@u.washington.edu.
The indentation and line breaks below are for readability,
and can be omitted. For instance, either of the following
is okay:
some text
or,
some text
Here is the template:
first line of one FASTA file
first line of the other FASTA file
Nucleotide histograms should give, for each base or 'ambiguity code' occurring in the sequence,
the letter denoting the base, followed by an equals sign, followed by an
integer giving the number of times the base occurs in the sequence.
Put a comma between the different bases.
For instance, A=50,C=50,G=50,T=50,N=2
A nucleotide histogram for the other fasta file
Put the location of the matching subsequence within the input sequence here. The location
should be the index of the DNA base in the sequence
that is closest to the beginning of the forward strand.
Use a coordinate system starting at 1 rather than 0.
For example, if the two chromosomal strands are:
5'-ACTGA-3'
3'-TGACT-5'
and you found the subsequence TCA on the reverse strand to be the longest
match to the other sequence, then the location
should be reported as 3. If instead you found CTG on the forward
strand, then the location should be reported as 2.
Put the location of the matching subsequence in the other input sequence here.
longest shared DNA subsequence goes here
(give additional matches of same length, if any, in same format as above ... )
Put a short identification of the shared DNA subsequence
here, for instance:
"This DNA sequence is the first 20 bases of an RNA polymerase
gene"
Any comments about your code or files should
go here.
file contents here.
Here is an example of a homework file with the fields filled in
with the correct answers for a test case (the program is given only in abbreviated form).
>gi|13507739|ref|NC_000912.1| Mycoplasma pneumoniae M129, complete genome
>gi|31544204|ref|NC_004829.1| Mycoplasma gallisepticum R, complete genome
A=249211,C=162920,T=240560,G=163703
A=343648,C=156658,T=339399,G=156717
122006
324176
GTCGGGTAAATTCCGTCCCGCTTGAATGGTGTAACCATCTCTTGACTGTCTCGGCTATAG
ACTCGGTGAAATCCAGGTACGGGTGAAGACACCCGTTAGGCGCAACGGGACGGAAAGACC
CC
122006
324176
GGGGTCTTTCCGTCCCGTTGCGCCTAACGGGTGTCTTCACCCGTACCTGGATTTCACCGA
GTCTATAGCCGAGACAGTCAAGAGATGGTTACACCATTCAAGCGGGACGGAATTTACCCG
AC
This sequence comes from the 23S ribosomal RNA gene.
run the bash file. It calls the python
script, which writes results to standard out.
python longest_repeated_subsequence.py -a NC_004829.fna -b NC_000912.fna
"""
this file reads in two FASTA files, and
finds the longest exactly repeated sequence
shared by the two organisms
"""
if __name__ == '__main__':
parse_command_line()
find_longest_repeated_subsequence()
report_results()