Genome 540 Homework Assignment 3
Due Sunday Jan. 30
- Write a program to find a highest-weight path in a weighted
directed acyclic graph. This program should
- read in an input file that contains a representation of the graph in the following form:
- a list of vertices, with each vertex on a separate line. The line should consist simply of a label (arbitrary string of up to 10 characters) for the vertex, which should be unique, i.e. different for different vertices. You may assume that parents precede children in the file, and then use this order in implementing the dynamic programming algorithm.
- a list of edges, with each edge on a separate line. The line should consist of
- a label for the edge (which need not be unique -- i.e. different edges can have the same
label);
- the label of the edge's beginning
vertex;
- the label of its ending vertex;
- the numerical weight
attached to the edge.
- use dynamic programming to
find the highest weight path (with additive, not multiplicative, path weights).
- output:
- the path score
- the label of the beginning vertex on the path
- the label of the ending vertex on the path
- (in order) the labels for all the edges that occur on this path. For example, if
the graph is a sequence graph and labels correspond to nucleotides in the sequence, the list of labels would correspond to the sequence of the highest scoring segment of the sequence.
- Write a program that
- reads in 2 input files:
-
a FASTA file containing a DNA sequence, and
- a 'scoring scheme' file that indicates a score to be attached to each possible base (A, C, G, T, or other)
-
creates the sequence graph for the sequence, as described in lecture (i.e. a 'linked list' with one edge per residue).
- the labels attached to successive vertices should be successive integers, represented as character strings (the first vertex being '0', the second being '1', the third being '2', and so on);
- the label attached to each edge should be the corresponding residue;
- the weight attached to an edge should be the score for that residue, as specified in the scoring scheme file.
- outputs this graph in a file that has the appropriate input format for the program described in #1 above.
- For a genome sequence:
- Run the program in #2 above, using as input files the .fna file for that genome
and a scoring scheme file that attaches -1.49 to an A or T nucleotide, and 0.74 to a G or C (and assigns 0 to any ambiguity characters in the sequence). Then run your program in #1 above on the resulting output file in order to find the highest scoring segment (= path) of the sequence. Use the second output option (i.e. printing path score, and the sequence itself).
Your final output should include
- the name and first line of the .fna file
- the location (the first and last nucleotide positions, as deduced from the vertex labels), the sequence (as indicated by the edge labels), and score of the highest-scoring
segment.
- For the homework assignment, you should run your programs on the genome sequence of Thermococcus kodakaraensis. To test whether your program is working correctly, run it first
on the test example Pyrococcus furiosus
to see whether you get the right answer (below).
- You must turn in your results and your computer
program, using the template (file format) described below.
Please put everything into ONE plain text file - do not send an archive
of files or a tar file, or a word processing document file. Compress it (using either Unix compress, or gzip -- if
you don't have access to either of these programs let us know), and
send it as an attachment to both Phil at phg@u.washington.edu, and
Tobias at mann@gs.washington.edu.
Here is the template:
first line of the .fna file that you use to find the
highest scoring segment
Nucleotide histograms should give, for each base or 'ambiguity code' occurring in the sequence,
the letter denoting the base, followed by an equals sign, followed by an
integer giving the number of times the base occurs in the sequence.
Put a comma between the different bases.
For instance, A=50,C=50,G=50,T=50,N=2
For this assignment, use the forward strand.
put the starting base of the highest scoring sequence here
followed by a dash and then the ending base in forward
strand coordinates. For example:
50-100
this should look like a nucleotide histogram except that
the score values may be any real numbers, e.g.
A=1.3,T=1.1,G=-2.7,C=-2.7
put the score of the highest scoring sequence segment here
the dna sequence of the highest scoring segment should go here.
You should identify the segment (based on checking the genbank annotations)
and put your identification here.
Any comments about your code or files should
go here.
file contents here.
Below is an example of a homework file with the fields filled in
with the correct answers for a test case.
>gi|18976372|ref|NC_003413.1| Pyrococcus furiosus DSM 3638, complete genome
A=565156, C=388629, T=565106, G=389365
9570-9890
A=-1.49,T=-1.49,G=0.74,C=0.74
50.22
GGCGGCGGGCTAGGCCGGGGGGTTCGGCGTCCCCTGTAACCGGAAACCGCCGATATGCCG
GGGCCGAAGCCCGGGGGGCGGTTCCCAAAGCCGCTCCCAGAAGCCGAGGTCGAACGATGA
GTCCTCGTCCCGCGGGGTGCCCGGTGGGGGAGGCACGGCTGAAGGGCCGTGCTAACCCCC
TTTGGGCCCCGAACCCCGCAAGGCCCGGAAGGGAGCAGCGGTAGGGGCCACGGAGCACGC
TCGCGGGGGTGCGGGGATGAGATAGGCCTCGGTGGATGGGAGCGGTGGAGGGTTCCCACC
CTCGGGCGTGCCCGCCGCCGC
This is a structural RNA gene and annotated as a signal
recognition particle. The annotated gene starts at 9570
and ends at 9892, including two bases past the end of the
maximally scoring segment. The actual sequence ends: CCGCTA.
The final 'TA' was not included in the maximal scoring segment
because the TA bases would decrease the score.
run the bash file. It calls the python
script, which writes results to standard out.
python hello_world.py
"""
This script writes 'hello world' to the standard out
"""
if __name__ == '__main__':
print "hello world"