Genome 540 Homework Assignment 4

Due Sunday Feb. 8

Write a program to find a highest-weight path in a weighted directed acyclic graph. This program should
- read in an input file that contains a representation of the graph in the following form:
  - a list of vertices, with each vertex on a separate line. The line should consist simply of a label (arbitrary character string) for the vertex, which should be unique, i.e. different for different vertices. You may assume that parents precede children in the file, and then use this order in implementing the dynamic programming algorithm.
  - a list of edges, with each edge on a separate line. The line should consist of
    1. a label for the edge (which need not be unique -- i.e. different edges can have the same label);
    2. the label of the edge's beginning vertex;
    3. the label of its ending vertex;
    4. the numerical weight attached to the edge.
- use dynamic programming to find the highest weight path (with additive, not multiplicative, path weights).
- have 2 output options:
  - print the path score, the labels of the beginning and ending vertices on the path, and the labels of the first and last edges in the path; or
  - print the path score, and (in order) the labels for all the edges that occur on this path, one label per line. (You will not need this particular option for the current assignment, but you will need it for a later assignment).
Write a program that
- reads in 2 input files:
  - a FASTA file containing a DNA sequence, and
  - a 'scoring scheme' file that indicates a score to be attached to each possible base (A, C, G, T, or other)
- creates the sequence graph for the sequence, as described in lecture (i.e. a 'linked list' with one edge per residue).
  - the labels attached to successive vertices should be successive integers, represented as character strings (the first vertex being '0', the second being '1', the third being '2', and so on);
  - the label attached to each edge should be the corresponding residue;
  - the weight attached to an edge should be the score for that residue, as specified in the scoring scheme file.
- outputs this graph in a file that has the appropriate input format for the program described in #1 above.
For the same genome sequence you used in HW assignments 1-3:
1. Run the program in #2 above, using as input files the .fna file for that genome and a scoring scheme file that attaches +1 to an A or T nucleotide, and -2 to a G or C (and assigns 0 to any ambiguity characters in the sequence). Then run your program in #1 above on the resulting output file in order to find the highest scoring segment (= path) of the sequence. Use the first output option (i.e. printing path score, and where the path begins and ends, but not the complete list of edges).
2. Same as above, but using the scoring scheme that attaches +1 to a G or C, and -2 to an A or T.
Your final output should include
- the name and first line of the .fna file
- for each of the two scoring schemes, the location (the first and last nucleotide positions, as indicated by the vertex labels), the first and last bases (as indicated by the edge labels), and score of the highest-scoring segment. Do NOT include the entire sequence of the segment, which may be very long.
Email the (final) output from #3 to me AND to Chris. Please make it as compact as possible. Do NOT send the code itself. Include the output in the body of your email message (as plain text), NOT as an attachment.