Genome 540 Homework Assignment 3

Due Sunday February 3

  1. Write a program that reads in three input protein sequences (from a FASTA file), together with a score matrix, creates the edit graph for the three sequences, and outputs this graph in a file in the appropriate input format for the WDAG program you wrote in part 1 of homework 2. Recall that, in the edit graph, vertices correspond to triples (i,j,k) where 0 <= i <= n1, 0 <= j <= n2, and 0 <= k <= n3, where n1, n2 and n3 are the lengths of the three sequences. There is an edge from (i,j,k) to (i',j',k') whenever i' = i or i+1, j' = j or j+1, and k' = k or k+1, and at least one of the equalities i'=i, j'=j, and k'=k is false. The label you should attach to an edge is the corresponding column of aligned residues & gap characters -- except that you can write it horizontally instead of vertically. For example, the label associated to the edge from (10,37,5) to (11,37,6) would be V-C if V is the 11th residue in the first sequence and C is the 6th residue in the third sequence. Note that different edges in the graph can have the same label.

    The weights attached to an edge should be the 'sum of pairs' weights. Specifically, each edge is labelled by 3 characters, say x_1, x_2, x_3, which may be residues or gap characters. There are 3 different unordered pairs of these characters, namely (x_1, x_2), (x_1, x_3) and (x_2, x_3). The weight attached to the edge is then the sum of the 3 pair weights, where the weight of (x_i, x_j) is

  2. Run your program to produce an output file giving the edit graph for the following three sequences:
    
    >gi|408363597|gb|AFU57327.1| CRISPR-associated protein, Cas2 family [Candidatus Nitrososphaera gargensis Ga9.2]
    MRYIIVYDITDDNLRTLTSETLKDYGLKRIQKSAFMGSLKKHALNSLLTDLRRMVNVDSVIVFPLCDSDFRNMVSIGKEFVEEEKQDVQFF
    
    >gi|238895124|ref|YP_002919859.1| CRISPR-associated Cas2 family protein [Klebsiella pneumoniae subsp. pneumoniae NTUH-K2044]
    MLVVLANDLPPAVRGRMKLWFVEPRPNVFVSGVKDSVAQTVVDYLLQYTPVESGLMLFRSIPQPPGYEIRYKGEVRKPIIDLSGLQLIIETLKLS
    
    >gi|379012739|ref|YP_005270551.1| CRISPR-associated protein Cas2 [Acetobacterium woodii DSM 1030]
    MYVILVYDICGDEGGQRVLNKTFKICKKYLNHIQNSVFEGEISEPQILNLKYELNKYIRKDIDSVILFKSRNERWLKKEFWANVEDKTDNFL
    
    
    using the BLOSUM62 score matrix for the pairwise scores:
       A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  Z  X  *
    A  4 -1 -2 -2  0 -1 -1  0 -2 -1 -1 -1 -1 -2 -1  1  0 -3 -2  0 -2 -1  0 -4
    R -1  5  0 -2 -3  1  0 -2  0 -3 -2  2 -1 -3 -2 -1 -1 -3 -2 -3 -1  0 -1 -4
    N -2  0  6  1 -3  0  0  0  1 -3 -3  0 -2 -3 -2  1  0 -4 -2 -3  3  0 -1 -4
    D -2 -2  1  6 -3  0  2 -1 -1 -3 -4 -1 -3 -3 -1  0 -1 -4 -3 -3  4  1 -1 -4
    C  0 -3 -3 -3  9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4
    Q -1  1  0  0 -3  5  2 -2  0 -3 -2  1  0 -3 -1  0 -1 -2 -1 -2  0  3 -1 -4
    E -1  0  0  2 -4  2  5 -2  0 -3 -3  1 -2 -3 -1  0 -1 -3 -2 -2  1  4 -1 -4
    G  0 -2  0 -1 -3 -2 -2  6 -2 -4 -4 -2 -3 -3 -2  0 -2 -2 -3 -3 -1 -2 -1 -4
    H -2  0  1 -1 -3  0  0 -2  8 -3 -3 -1 -2 -1 -2 -1 -2 -2  2 -3  0  0 -1 -4
    I -1 -3 -3 -3 -1 -3 -3 -4 -3  4  2 -3  1  0 -3 -2 -1 -3 -1  3 -3 -3 -1 -4
    L -1 -2 -3 -4 -1 -2 -3 -4 -3  2  4 -2  2  0 -3 -2 -1 -2 -1  1 -4 -3 -1 -4
    K -1  2  0 -1 -3  1  1 -2 -1 -3 -2  5 -1 -3 -1  0 -1 -3 -2 -2  0  1 -1 -4
    M -1 -1 -2 -3 -1  0 -2 -3 -2  1  2 -1  5  0 -2 -1 -1 -1 -1  1 -3 -1 -1 -4
    F -2 -3 -3 -3 -2 -3 -3 -3 -1  0  0 -3  0  6 -4 -2 -2  1  3 -1 -3 -3 -1 -4
    P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4  7 -1 -1 -4 -3 -2 -2 -1 -2 -4
    S  1 -1  1  0 -1  0  0  0 -1 -2 -2  0 -1 -2 -1  4  1 -3 -2 -2  0  0  0 -4
    T  0 -1  0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1  1  5 -2 -2  0 -1 -1  0 -4
    W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1  1 -4 -3 -2 11  2 -3 -4 -3 -2 -4
    Y -2 -2 -2 -3 -2 -1 -2 -3  2 -1 -1 -2 -1  3 -3 -2 -2  2  7 -1 -3 -2 -1 -4
    V  0 -3 -3 -3 -1 -2 -2 -3 -3  3  1 -2  1 -1 -2 -2  0 -3 -1  4 -3 -2 -1 -4
    B -2 -1  3  4 -3  0  1 -1  0 -3 -4  0 -3 -3 -2  0 -1 -4 -3 -3  4  1 -1 -4
    Z -1  0  0  1 -3  3  4 -2  0 -3 -3  1 -1 -3 -1  0 -1 -3 -2 -2  1  4 -1 -4
    X  0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2  0  0 -2 -1 -1 -1 -1 -1 -4
    * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4  1
    
    Gap penalty: -6
    
    Then run your WDAG program from homework # 2 on this file to produce a highest scoring path (and its score) giving a highest-scoring alignment of the three sequences. You should adjust the program output such that each of the edge labels (corresponding to an aligned column of residues) appears on a separate line.
  3. You must turn in your results and your computer program; use this file as a template. The data definition at the top of the file allows you to validate the format of your results file. To check that you have valid XML syntax you can simply open the XML file with a browser like firefox. To do a more comprehensive validation (i.e. check that you have all of the required elements) use an online XML validator. Wrap your source code between the <file> elements with a CDATA definition (as in the results template) so special characters in the code (e.g. <) can be ignored by the XML parser. Please put everything into ONE plain text file - do not send an archive of files or a tar file, or a word processing document file. Compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs let us know), and send it as an attachment to both Phil (phg (at) u.washington.edu) and Sharon (greensi (at) uw.edu).

    Last year's HW3

    Answer to last year's HW3