Genome 540 Homework Assignment 2

Due Saturday Jan. 21

  1. Write a program to find a highest-weight path in a weighted directed acyclic graph. This program should
  2. Write a program that
  3. For a graphic depiction of a WDAG:
  4. For a genome sequence: Your final output should include
  5. For the homework assignment, you should run your programs on the graph depicted here, constraining the path to start at the node 'x' and end at the node 'i'; AND on the genome sequence of Thermococcus kodakaraensis. To test whether your program is working correctly, run it first on the test example graph, constraining the path to start at 'vii' and end at 'i'; and on the test genome sequence Pyrococcus furiosus to see whether you get the right answer (below). In the depicted graphs, the shape of the triangle indicates the direction in which the edge is pointing.
  6. You must turn in your results and your computer program, using the template (file format) described below. Please put everything into ONE plain text file - do not send an archive of files or a tar file, or a word processing document file. Compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs let us know), and send it as an attachment to both Phil (phg (at) u.washington.edu) and Aaron (aklammer (at) u.washington.edu).
Here is the template: <gs540_hw assignment='2' name='student name' email='student email'> <results> <result type='part1' file='filename'> This is the result for the first part of the homework assignment (described in item 1 above). <firstline> The first line of the original file used to create this graph (in the case of a DNA sequence, the first line of the fasta file; in the case of a hand-created file from a graphical pdf, leave this blank). </firstline> <edgeweights> Weights associated with each edge label (not each edge!) in the graph. Put a comma between different edge labels. For instance: A=1.0, T=1.1, C=-2.3, G=-3.14 </edgeweights> <edgehistogram> Histogram showing the number of occurrences of each edge label in the graph. Looks like edgeweights, except numbers must be non-negative integers. </edgehistogram> <score> Put the score of the highest scoring path in the graph here. </score> <beginningvertex> Starting vertex label. </beginningvertex> <endingvertex> Ending vertex label. </endingvertex> <path> The edges in the highest scoring path. </path> </result> <result type='part2' file='NC_003413.fna'> The same format as part1 above, except using output from the second part of the homework assignment (described in item 3 above). </result> </results> <analysis> You should identify the DNA segment here (based on genbank annotations). </analysis> <program> <comments> Any comments about your code or files should go here. </comments> <file name='filename1'> First file contents here. </file> <file name='filename2'> Second file contents here. </file> </program> </gs540_hw> Below is an example of a homework file with the fields filled in with the correct answers for the test cases. <gs540_hw assignment='2' name='Aaron Klammer' email='aklammer@u.washington.edu'> <results> <result type='part1' file='hw2_sample.graph'> <firstline> </firstline> <edgeweights> A=-1, B=5, C=-2, D=4, E=-3, F=2, G=-3, H=-5, I=4, J=-3, K=1, L=-3 </edgeweights> <edgehistogram> A=1, B=1, C=1, D=1, E=1, F=1, G=1, H=1, I=1, J=1, K=1, L=1 </edgehistogram> <score> 4 </score> <beginningvertex> vii </beginningvertex> <endingvertex> i </endingvertex> <path> LIDA </path> </result> <result type='part2' file='NC_003413.fna'> <firstline> >gi|18976372|ref|NC_003413.1| Pyrococcus furiosus DSM 3638, complete genome </firstline> <edgeweights> A=-1.49,T=-1.49,G=0.74,C=0.74 </edgeweights> <edgehistogram> A=565156, C=388629, T=565106, G=389365 </edgehistogram> <score> 50.22 </score> <beginningvertex> 9569 </beginningvertex> <endingvertex> 9890 </endingvertex> <path> GGCGGCGGGCTAGGCCGGGGGGTTCGGCGTCCCCTGTAACCGGAAACCGCCGATATGCCG GGGCCGAAGCCCGGGGGGCGGTTCCCAAAGCCGCTCCCAGAAGCCGAGGTCGAACGATGA GTCCTCGTCCCGCGGGGTGCCCGGTGGGGGAGGCACGGCTGAAGGGCCGTGCTAACCCCC TTTGGGCCCCGAACCCCGCAAGGCCCGGAAGGGAGCAGCGGTAGGGGCCACGGAGCACGC TCGCGGGGGTGCGGGGATGAGATAGGCCTCGGTGGATGGGAGCGGTGGAGGGTTCCCACC CTCGGGCGTGCCCGCCGCCGC </path> </result> </results> <analysis> This is a structural RNA gene. </analysis> <program> <comments> Run the two bash files, the first for part 1, the second for part2. They call the python scripts, which write results to standard out. </comments> <file name='hw2_part1.bash'> python hello_world_part1.py </file> <file name='hello_world_part1.py'> """ This script writes 'hello world' to the standard out """ if __name__ == '__main__': print "hello world" </file> <file name='hw2_part2.bash'> python hello_world.py </file> <file name='hello_world_part2.py'> """ This script writes 'hello world' to the standard out """ if __name__ == '__main__': print "hello world" </file> </program> </gs540_hw>