Genome 540 Homework Assignment 3
Due Sunday Feb. 1
- For the same bacterium you used in assignments 1 and 2, write a program that implements the method described in
Lecture 6 to find the longest perfectly-repeated subsequence in the
genome (i.e. the longest subsequence occurring at least twice, with no
differences), indicating the location of each copy. If there are
several different perfectly-repeated sequences of the same length,
find all of them. When you are looking for these, consider
both strands of the genome (i.e. both the sequence given in the
.fna file and its reverse complement) simultaneously, so that if the
repeated sequence occurs only once on each strand you will still find
it. The easiest way to consider both strands simultaneously is to
create a single "merged" sequence of length 2N (if N is the size of
the genome in bp) which appends the bottom strand sequence to the end of the top
strand sequence, and then create a single list of pointers to each
position in this merged sequence, which is then sorted. Your output should provide
- the name and first line of the .fna file
- a list of the
repeated sequences, giving in each case the first t10 bases and last 10 bases of the sequence itself separated by ... (for example, ATATAAGTCA...CGAGAACATG), its length, and each location where
it occurs (the first and last nucleotide positions, and which strand
it is on).
- Email this to me AND to Chris. Please make it as compact
as possible. Although the repeated sequence might be long, the list itself should not be -- if there are a huge number of copies of one repeat, or a large number of different repeated sequences all of exactly the same length, you are probably doing something wrong and should check with me or Chris before sending your output. Do NOT send the code itself. Include the output in the
body of your email message (as plain text), NOT as an attachment.