Genome 540 Homework Assignment 1

(Winter Quarter 2022)

Due Sunday Jan. 16, 11:59 pm

Late homework policy is described on course web page.

Download and begin reading Initial sequencing and analysis of the human genome. The GenomeInternational Sequencing Consortium. Nature 409, 860-921 (15 February 2001) . (To print this out, I would recommend the pdf format which corresponds exactly to the printed version, rather than the html format.) For next week, read:
- Introduction and background (pp. 860-863, up to but not including "Strategic issues")
- The section "Broad genomic landscape" (pp. 875-879, up to but not including "Repeat content of the human genome")
- The section "Gene content of the human genome" (starting p. 892) up to but not including "Comparative proteome analysis" (p. 901).
Write a program that implements the method described in Lecture 2 to find, for each suffix in the 'forward' strand of sequence 1, the length of the longest matching subsequence in sequence 2 (or its reverse complement). It should report a histogram of these lengths, and also the longest exactly matching subsequence between two sequences. Specifically, your program should:
- Read two input files in "FASTA" format (i.e. having a header line which starts with the character ">" and includes the sequence name, with the sequence itself following on subsequent lines), each of which contains one of the sequences to be compared. (Note that some nucleotides may be lower-case; when storing the sequence in memory, be sure to convert these to upper case, so that the lexicographic sort will handle them correctly). As a data check, your program should count the number of bases of each type (i.e. A, C, T, G, or N) in each sequence.
- Use the described algorithm to find the longest matching subsequence in sequence 2 for each suffix in sequence 1. When you are looking for these, you need to allow for the fact that matching subsequences may occur on opposite strands in the two sequences. You must consider both strands of sequence 2 (i.e. both the sequence given in the FASTA file, which is sometimes called the 'forward' strand, and its reverse complement, which is sometimes called the 'reverse' strand), but you only need to consider the forward strand of sequence 1. The easiest way to do this is to create and store in memory a single sequence of length N + 2M + 3 (where the two input sequences have lengths N and M bases respectively) constructed by concatenating together sequence 1's forward strand and sequence 2's forward and reverse strands (each terminated by a null character to terminate the strings, so that the lexicographic sort does not read across the sequence boundaries), and then create a single list of pointers to each position in this merged sequence, which you then sort. For each suffix in the forward strand of sequence 1, find the longest matching subsequence in the forward or reverse strand of sequence 2.
- Construct a histogram which indicates, for each length, the number of sequence 1 suffices having that match length to sequence 2. The histogram template can be seen in the template file below. If you sum the counts over all lengths, the total should equal the length of sequence 1. When reporting the histogram, eliminate lengths that have counts of 0.
- Also report the longest subsequence present in both sequences. If there are several different perfectly repeated subsequences of the same maximum length, find all of them. If a longest subsequence is present multiple times in either sequence, please report all locations of the subsequence.
Using this program, you should then find the match length histogram and longest exactly matching sequences in orthologous 10-megabase regions in the human and mouse genomes. Several interesting genes are located here. For example, goosecoid (GSC) is a homeobox gene which plays a crucial role in the phenomenon of Spemann's Organizer (Ref). DICER1 is involved in the maturation of several classes of small non-coding RNAs, such as microRNAs (Ref). Once you find the longest match(es) using your program, you should then try to figure out what biological feature they correspond to. You can do this by exploring the UCSC browser (hg38 and mm10). Hint: see this Science paper on "ultraconserved elements" (UCEs; >=200 bp exact match between mouse and human). Extra credit (1 point): based on your histogram, how many UCEs are there in this region? Justify your answer.
To test whether your program is working correctly, run it first on the test example indicated below (with two different bacterial genomes) to see whether you get the right answer. The FASTA files for the test examples can be found here and here. The General feature format files are here and here. In general, FASTA files for these and other bacterial genomes can be found by going to the NCBI web site and following appropriate links. (On the NCBI site, the FASTA files containing the full genome sequences have the suffix .fna. To find the biological features, look in the 'General feature format' or 'Genbank format' files for the organisms (which have the suffix .gff and .gbk on the NCBI site) and find the annotated 'feature' in each genome that overlaps the matching segment you found. You don't need to write a program to do this -- you can just read the .gff or .gbk file on the website.) .gff files are also searchable by 'bedtools intersect -a {your.gff} -b {region_of_interest.bed}'.
You must turn in your results and your computer program, using this file as a template. Please put everything into ONE file - do not send an archive of files or a tar file. After creating a plain text file (NOT a word processing document file) in this format, compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs, let us know), and send it as an attachment to both Phil at phg@uw.edu and CX at cxqiu@uw.edu.

Details:

Fasta: put the name of the fasta file, along with the first line.

Nucleotide histogram: count the total number of bases ("*") and the number of times each specific base occurs. i.e.:

Match Length Histogram: report a histogram which indicates, for each length, the number of sequence 1 suffices having that match length to sequence 2. i.e.:

  Match Length Histogram:
  1 match count
  2 match count
  3 match count
  .
  .
  list all the match length counts
  (only first/last 3 shown here)
  .
  .
  20 match count
  21 match count
  22 match count

Extra credit question: Based on your histogram, report the number of UCEs in this region and justify your answer.

Description (for the longest match): Put a short identification of the shared DNA subsequence, for instance: "This DNA sequence is the first 20 bases of an RNA polymerase gene."

Position (for the longest match): Put the location of the matching subsequence within the input sequence here (rather than the absolute location in the given genome). The location should be the index of the DNA base in the sequence that is closest to the beginning of the forward strand. Use a coordinate system starting at 1 rather than 0. For example, if the two chromosomal strands are:

  			 5'-ACTGA-3'
  			 3'-TGACT-5'

and you found the subsequence TCA on the reverse strand to be the longest match to the other sequence, then the location should be reported as 3. If instead you found CTG on the forward strand, then the location should be reported as 2.