Genome 540 Homework Assignment 3

Due Monday Feb. 3

  1. For the same bacterium you used in assignments 1 and 2, write a program that implements the method described in Lecture 7 to find the longest perfectly-repeated subsequence in the genome (i.e. the longest subsequence occurring at least twice, with no differences), indicating the location of each copy. If there are several different perfectly-repeated sequences of the same length, find all of them. When you are looking for these, consider both strands of the genome (i.e. both the sequence given in the .fna file and its reverse complement) simultaneously, so that if the repeated sequence occurs only once on each strand you will still find it. The easiest way to consider both strands simultaneously is to create a single "merged" sequence of length 2N (if N is the size of the genome in bp) which appends the bottom strand sequence to the end of the top strand sequence, and then create a single list of pointers to each position in this merged sequence, which is then sorted. Your output should provide
  2. Email this to me. Please make it as compact as possible. Although the repeated sequence might be long, the list itself should not be -- if there are a huge number of copies of one repeat, or a large number of different repeated sequences all of exactly the same length, you are probably doing something wrong and should check with me before sending your output. Do NOT send the code itself. Include the output in the body of your email message (as plain text), NOT as an attachment.