Genome 540 Homework Assignment 3

Due Sunday January 31

  1. Write a program that reads in three input protein sequences (from a FASTA file), together with a score matrix, creates the edit graph for the three sequences, and outputs this graph in a file in the appropriate input format for the WDAG program you wrote in part 1 of homework 2. Recall that, in the edit graph, vertices correspond to triples (i,j,k) where 0 <= i <= n1, 0 <= j <= n2, and 0 <= k <= n3, where n1, n2 and n3 are the lengths of the three sequences. There is an edge from (i,j,k) to (i',j',k') whenever i' = i or i+1, j' = j or j+1, and k' = k or k+1, and at least one of the equalities i'=i, j'=j, and k'=k is false. The label you should attach to an edge is the corresponding column of aligned residues & gap characters -- except that you can write it horizontally instead of vertically. For example, the label associated to the edge from (10,37,5) to (11,37,6) would be V-C if V is the 11th residue in the first sequence and C is the 6th residue in the third sequence. Note that different edges in the graph can have the same label.

    The weights attached to an edge should be the 'sum of pairs' weights. Specifically, each edge is labelled by 3 characters, say x_1, x_2, x_3, which may be residues or gap characters. There are 3 different unordered pairs of these characters, namely (x_1, x_2), (x_1, x_3) and (x_2, x_3). The weight attached to the edge is then the sum of the 3 pair weights, where the weight of (x_i, x_j) is

  2. Run your program to produce an output file giving the edit graph for the following three sequences. (Dynein light chain 1 is one of the motor proteins involved in transporting cellular cargo by "walking" along microtubules.)
    
    >sp|P63167|DYL1_HUMAN Dynein light chain 1, cytoplasmic OS=Homo sapiens GN=DYNLL1 PE=1 SV=1
    MCDRKAVIKNADMSEEMQQDSVECATQALEKYNIEKDIAAHIKKEFDKKYNPTWHCIVGRNFGSYVTHETKHFIYFYLGQVAILLFKSG
    
    >sp|Q02647|DYL1_YEAST Dynein light chain 1, cytoplasmic OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) GN=DYN2 PE=1 SV=1
    MSDENKSTPIVKASDITDKLKEDILTISKDALDKYQLERDIAGTVKKQLDVKYGNTWHVIVGKNFGSYVTHEKGHFVYFYIGPLAFLVFKTA
    
    >sp|Q22799|DYL1_CAEEL Dynein light chain 1, cytoplasmic OS=Caenorhabditis elegans GN=dlc-1 PE=1 SV=1
    MVDRKAVIKNADMSDDMQQDAIDCATQALEKYNIEKDIAAYIKKEFDKKYNPTWHCIVGRNFGSYVTHETKHFIYFYLGQVAILLFKSG
    
    
    using the BLOSUM62 score matrix for the pairwise scores:
       A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  Z  X  *
    A  4 -1 -2 -2  0 -1 -1  0 -2 -1 -1 -1 -1 -2 -1  1  0 -3 -2  0 -2 -1  0 -4
    R -1  5  0 -2 -3  1  0 -2  0 -3 -2  2 -1 -3 -2 -1 -1 -3 -2 -3 -1  0 -1 -4
    N -2  0  6  1 -3  0  0  0  1 -3 -3  0 -2 -3 -2  1  0 -4 -2 -3  3  0 -1 -4
    D -2 -2  1  6 -3  0  2 -1 -1 -3 -4 -1 -3 -3 -1  0 -1 -4 -3 -3  4  1 -1 -4
    C  0 -3 -3 -3  9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4
    Q -1  1  0  0 -3  5  2 -2  0 -3 -2  1  0 -3 -1  0 -1 -2 -1 -2  0  3 -1 -4
    E -1  0  0  2 -4  2  5 -2  0 -3 -3  1 -2 -3 -1  0 -1 -3 -2 -2  1  4 -1 -4
    G  0 -2  0 -1 -3 -2 -2  6 -2 -4 -4 -2 -3 -3 -2  0 -2 -2 -3 -3 -1 -2 -1 -4
    H -2  0  1 -1 -3  0  0 -2  8 -3 -3 -1 -2 -1 -2 -1 -2 -2  2 -3  0  0 -1 -4
    I -1 -3 -3 -3 -1 -3 -3 -4 -3  4  2 -3  1  0 -3 -2 -1 -3 -1  3 -3 -3 -1 -4
    L -1 -2 -3 -4 -1 -2 -3 -4 -3  2  4 -2  2  0 -3 -2 -1 -2 -1  1 -4 -3 -1 -4
    K -1  2  0 -1 -3  1  1 -2 -1 -3 -2  5 -1 -3 -1  0 -1 -3 -2 -2  0  1 -1 -4
    M -1 -1 -2 -3 -1  0 -2 -3 -2  1  2 -1  5  0 -2 -1 -1 -1 -1  1 -3 -1 -1 -4
    F -2 -3 -3 -3 -2 -3 -3 -3 -1  0  0 -3  0  6 -4 -2 -2  1  3 -1 -3 -3 -1 -4
    P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4  7 -1 -1 -4 -3 -2 -2 -1 -2 -4
    S  1 -1  1  0 -1  0  0  0 -1 -2 -2  0 -1 -2 -1  4  1 -3 -2 -2  0  0  0 -4
    T  0 -1  0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1  1  5 -2 -2  0 -1 -1  0 -4
    W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1  1 -4 -3 -2 11  2 -3 -4 -3 -2 -4
    Y -2 -2 -2 -3 -2 -1 -2 -3  2 -1 -1 -2 -1  3 -3 -2 -2  2  7 -1 -3 -2 -1 -4
    V  0 -3 -3 -3 -1 -2 -2 -3 -3  3  1 -2  1 -1 -2 -2  0 -3 -1  4 -3 -2 -1 -4
    B -2 -1  3  4 -3  0  1 -1  0 -3 -4  0 -3 -3 -2  0 -1 -4 -3 -3  4  1 -1 -4
    Z -1  0  0  1 -3  3  4 -2  0 -3 -3  1 -1 -3 -1  0 -1 -3 -2 -2  1  4 -1 -4
    X  0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2  0  0 -2 -1 -1 -1 -1 -1 -4
    * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4  1
    
    Gap penalty: -6
    
    Then run your WDAG program from homework # 2 [you may have to modify it] on this file to produce a highest scoring path (and its score) giving a highest-scoring alignment of the three sequences. You should adjust the program output such that each of the edge labels (corresponding to an aligned column of residues) appears on a separate line.

    Your program should output the following:

    1. The maximum path score
    2. A list of all edge weights (sorted alphabetically by edge name)
    3. A histogram of edge counts (again, sorted alphabetically by edge name)
    4. The highest-scoring alignment, formatted vertically (as described above)

  3. You must turn in your results and your computer program; use this file as a template. Please put everything into ONE plain text file - do not send an archive of files or a tar file, or a word processing document file. Compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs let us know), and send it as an attachment to both Phil (phg (at) u.washington.edu) and Anne (aclark4 (at) uw.edu).

    HW3 from 2013

    HW3 from 2012

    Answer to 2013 HW3 [in the current format]

    Answer to 2012 HW3 [in the current format]