Genome 540 Homework Assignment 4

Due Sunday February 5

  1. Write a program that reads in three input protein sequences (from a FASTA file), together with a score matrix, creates the edit graph for the three sequences, and outputs this graph in a file in the appropriate input format for the WDAG program you wrote in part 1 of homework 2. Recall that, in the edit graph, vertices correspond to triples (i,j,k) where 0 <= i <= n1, 0 <= j <= n2, and 0 <= k <= n3, where n1, n2 and n3 are the lengths of the three sequences. There is an edge from (i,j,k) to (i',j',k') whenever i' = i or i+1, j' = j or j+1, and k' = k or k+1, and at least one of the equalities i'=i, j'=j, and k'=k is false. The label you should attach to an edge is the corresponding column of aligned residues & gap characters -- except that you can write it horizontally instead of vertically. For example, the label associated to the edge from (10,37,5) to (11,37,6) would be V-C if V is the 11th residue in the first sequence and C is the 6th residue in the third sequence. Note that different edges in the graph can have the same label.

    The weights attached to an edge should be the 'sum of pairs' weights. Specifically, each edge is labelled by 3 characters, say x_1, x_2, x_3, which may be residues or gap characters. There are 3 different unordered pairs of these characters, namely (x_1, x_2), (x_1, x_3) and (x_2, x_3). The weight attached to the edge is then the sum of the 3 pair weights, where the weight of (x_i, x_j) is

  2. Run your program to produce an output file giving the edit graph for the following three sequences.
    
    >tr|J3QS45|J3QS45_HUMAN L-xylulose reductase (Fragment) OS=Homo sapiens GN=DCXR PE=1 SV=1
    XLVNNAAVALLQPFLEVTKEAFDRSFEVNLRAVIQVSQIVARGLIARGVPGAIVNVSSQC
    SQRAVTNHSVYCSTKGALDMLTKVMALELGPHKLSRCRSE
    
    >tr|A0A164NGQ8|A0A164NGQ8_9CRUS L-xylulose reductase OS=Daphnia magna GN=APZ42_030734 PE=4 SV=1
    MIDSKTGGSIVHVSSQASQAPLKDHAVYCSAKAALDMLCKVMALELGQHKIRVNCVNPTV
    VLTEMGKLGWSDPTKANPMLAGIPLGRFAEVEEVVDAVV
    
    >tr|A0A1D2MGF9|A0A1D2MGF9_ORCCI L-xylulose reductase OS=Orchesella cincta GN=Ocin01_14604 PE=4 SV=1
    MSQNYPICADLTDLEQTQKAVKPIEAVDCLINNAGIMELADFLGISPEHVERQFSVNLKA
    VVFLSQIIVKKMIAAGNGGSILNMSSILGQRPVKFCGIYNCTKAALTC
    
    
    using the BLOSUM62 score matrix for the pairwise scores:
       A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  Z  X  *
    A  4 -1 -2 -2  0 -1 -1  0 -2 -1 -1 -1 -1 -2 -1  1  0 -3 -2  0 -2 -1  0 -4
    R -1  5  0 -2 -3  1  0 -2  0 -3 -2  2 -1 -3 -2 -1 -1 -3 -2 -3 -1  0 -1 -4
    N -2  0  6  1 -3  0  0  0  1 -3 -3  0 -2 -3 -2  1  0 -4 -2 -3  3  0 -1 -4
    D -2 -2  1  6 -3  0  2 -1 -1 -3 -4 -1 -3 -3 -1  0 -1 -4 -3 -3  4  1 -1 -4
    C  0 -3 -3 -3  9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4
    Q -1  1  0  0 -3  5  2 -2  0 -3 -2  1  0 -3 -1  0 -1 -2 -1 -2  0  3 -1 -4
    E -1  0  0  2 -4  2  5 -2  0 -3 -3  1 -2 -3 -1  0 -1 -3 -2 -2  1  4 -1 -4
    G  0 -2  0 -1 -3 -2 -2  6 -2 -4 -4 -2 -3 -3 -2  0 -2 -2 -3 -3 -1 -2 -1 -4
    H -2  0  1 -1 -3  0  0 -2  8 -3 -3 -1 -2 -1 -2 -1 -2 -2  2 -3  0  0 -1 -4
    I -1 -3 -3 -3 -1 -3 -3 -4 -3  4  2 -3  1  0 -3 -2 -1 -3 -1  3 -3 -3 -1 -4
    L -1 -2 -3 -4 -1 -2 -3 -4 -3  2  4 -2  2  0 -3 -2 -1 -2 -1  1 -4 -3 -1 -4
    K -1  2  0 -1 -3  1  1 -2 -1 -3 -2  5 -1 -3 -1  0 -1 -3 -2 -2  0  1 -1 -4
    M -1 -1 -2 -3 -1  0 -2 -3 -2  1  2 -1  5  0 -2 -1 -1 -1 -1  1 -3 -1 -1 -4
    F -2 -3 -3 -3 -2 -3 -3 -3 -1  0  0 -3  0  6 -4 -2 -2  1  3 -1 -3 -3 -1 -4
    P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4  7 -1 -1 -4 -3 -2 -2 -1 -2 -4
    S  1 -1  1  0 -1  0  0  0 -1 -2 -2  0 -1 -2 -1  4  1 -3 -2 -2  0  0  0 -4
    T  0 -1  0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1  1  5 -2 -2  0 -1 -1  0 -4
    W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1  1 -4 -3 -2 11  2 -3 -4 -3 -2 -4
    Y -2 -2 -2 -3 -2 -1 -2 -3  2 -1 -1 -2 -1  3 -3 -2 -2  2  7 -1 -3 -2 -1 -4
    V  0 -3 -3 -3 -1 -2 -2 -3 -3  3  1 -2  1 -1 -2 -2  0 -3 -1  4 -3 -2 -1 -4
    B -2 -1  3  4 -3  0  1 -1  0 -3 -4  0 -3 -3 -2  0 -1 -4 -3 -3  4  1 -1 -4
    Z -1  0  0  1 -3  3  4 -2  0 -3 -3  1 -1 -3 -1  0 -1 -3 -2 -2  1  4 -1 -4
    X  0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2  0  0 -2 -1 -1 -1 -1 -1 -4
    * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4  1
    
    Gap penalty: -6
    
    Then run your WDAG program from homework # 2 [you may have to modify it] on this file to produce a highest scoring path (and its score) giving a highest-scoring alignment of the three sequences. You should adjust the program output such that each of the edge labels (corresponding to an aligned column of residues) appears on a separate line. You could copy the code and modify the print function, or if you want some general coding practice you could add a command line option to allow your original program to do both horizontal and vertical edge labels (e.g. '--vertical', though probably not '-v' because that's usually for verbose output).

    Your program should output the following:

    1. The maximum path score
    2. A list of all edge weights (sorted alphabetically by edge name)
    3. A histogram of edge counts (again, sorted alphabetically by edge name)
    4. The highest-scoring alignment, formatted vertically (as described above)

  3. You must turn in your results and your computer program; use this file as a template. Please put everything into ONE plain text file - do not send an archive of files or a tar file, or a word processing document file. Compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs let us know), and send it as an attachment to both Phil (phg (at) u.washington.edu) and Alex (engal (at) uw.edu).

    HW4 from 2013

    HW4 from 2012

    Answer to 2013 HW4 [in the current format]

    Answer to 2012 HW4 [in the current format]