Genome 540 Homework Assignment 4

(Winter Quarter 2024)

Due Sunday Feb 4, 11:59pm

Write a program to find a highest-weight path in a weighted directed acyclic graph. This program should:
- read in an input file that contains a representation of the graph in the following form:
  - a list of vertices, with each vertex on a separate line. The line should consist of
    1. a line type indicator such as 'V' to specify that the line describes a vertex
    2. a label (arbitrary string of up to 10 characters) for the vertex, which should be unique, i.e. different for different vertices.
    3. the string "START" if the path is to be constrained to start at this vertex
    4. the string "END" if the path is to be constrained to end at this vertex
    At most one vertex should be designated the START and at most one vertex should be designated the END. If none are, the path is assumed to be unconstrained. You may assume that parents precede children in the file, and then use this order in implementing the dynamic programming algorithm.
  - a list of edges, with each edge on a separate line. The line should consist of
    1. a line type indicator such as 'E' to specify that the line describes an edge
    2. a label for the edge (which need not be unique -- i.e. different edges can have the same label);
    3. the label of the edge's beginning vertex;
    4. the label of its ending vertex;
    5. the numerical weight attached to the edge.
- use dynamic programming to find the highest weight path (with additive, not multiplicative, path weights).
- output:
  - the path score
  - the label of the beginning vertex on the path
  - the label of the ending vertex on the path
  - (in order) the labels for all the edges that occur on this path. For example, if the graph is a sequence graph and labels correspond to nucleotides in the sequence, the list of labels would correspond to the sequence of the highest scoring segment of the sequence.
Write a program that:
- reads in 2 input files:
  - a FASTA file containing a DNA sequence, and
  - a 'scoring scheme' file that indicates a score to be attached to each possible base (A, C, G, T, or other)
- creates the sequence graph for the sequence, as described in lecture (i.e. a 'linked list' with one edge per residue).
  - the labels attached to successive vertices should be successive integers, represented as character strings (the first vertex being '0', the second being '1', the third being '2', and so on);
  - the label attached to each edge should be the corresponding residue;
  - the weight attached to an edge should be the score for that residue, as specified in the scoring scheme file.
- outputs this graph in a file that has the appropriate input format for the program described in #1 above (with no vertices designated START or END).
For a graphic depiction of a WDAG:
- Convert (by hand) the graph into an input file having the format described in #1 above.
- Run your program in #1 above on this input file.
For a genome sequence:
- Run the program in #2 above, using as input files the FASTA file for that genomeand a scoring scheme file that attaches -1.49 to an A or T nucleotide, and 0.74 to a G or C (and assigns 0 to any ambiguity characters in the sequence). Then run your program in #1 above on the resulting output file in order to find the highest scoring segment (= path) of the sequence.
Your final output should include:
- the first line of the .fna file
- the nucleotide count histogram
- the output of the program in #1:
  - the score of the highest-scoring segment
  - the location (the first and last nucleotide positions, as deduced from the vertex labels)
  - the sequence (as indicated by the edge labels)
- a description of the segment (look up entry in .gbff/.gff annotation file)
For the homework assignment, you should run your programs on the graph depicted here AND on the genome sequence of S. pyogenes (gbff). For the graph, initially find the maximum weighted path without start/end constraints. Then repeat the analysis constraining the path to start at node 'i' and end at node 'xiii'. Do not use any start/end constraint for the genome sequence analysis.To test whether your program is working correctly, run it first on the test example graph (both without constraint and requiring the path start at node 'vii' and end at node 'i'); and on the test genome sequence of Mycoplasma gallisepticum (gff). Solutions to for the example graph and Mycoplasma gallisepticum genome sequence are provided in the results template below.
You must turn in your results and your computer program, using this file as a template. Please put everything into ONE file - do not send an archive of files or a tar file. After creating a plain text file (NOT a word processing document file) in this format, compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs, let us know), and send it as an attachment to both Phil at phg@uw.edu and Cliff at crostomily@gmail.com