Genome 540 Homework Assignment 1

(Winter Quarter 2023)

Due Sunday Jan. 15, 11:59 pm

Late homework policy is described on course web page.

Download and begin reading Initial sequencing and analysis of the human genome. The GenomeInternational Sequencing Consortium. Nature 409, 860-921 (15 February 2001) . (To print this out, I would recommend the pdf format which corresponds exactly to the printed version, rather than the html format.) For next week, read:
- Introduction and background (pp. 860-863, up to but not including "Strategic issues")
- The section "Broad genomic landscape" (pp. 875-879, up to but not including "Repeat content of the human genome")
- The section "Gene content of the human genome" (starting p. 892) up to but not including "Comparative proteome analysis" (p. 901).
Write a program that implements the method described in Lecture 2 to find, for each suffix in the 'forward' strand of sequence 1, the length of the longest matching subsequence in sequence 2 (or its reverse complement). It should report a histogram of these lengths, and also the longest exactly matching subsequence between two sequences. Specifically, your program should:
- Read two input files in "FASTA" format (i.e. having a header line which starts with the character ">" and includes the sequence name, with the sequence itself following on subsequent lines), each of which contains one of the sequences to be compared. You can estimate the amount of space required in memory from the file size, allocate it, and then read in the files. Note that in each fasta file, sequence position numbers have been added to the start of only some of the lines. Each of these lines starts with a number, a space, and then that line's sequence. When loading the sequence you will need to handle this accordingly, and additionally report the number of non-alphabetic characters that were found (exclude header line characters from this count). Note also that some nucleotides may be lower-case; when storing the sequence in memory, be sure to convert these to upper case, so that the lexicographic sort will handle them correctly. As a data check, your program should count the number of bases of each type (i.e. A, C, T, G, or N) in each sequence.
- Use the described algorithm to find the longest matching subsequence in sequence 2 for each suffix in sequence 1. When you are looking for these, you need to allow for the fact that matching subsequences may occur on opposite strands in the two sequences. You must consider both strands of sequence 2 (i.e. both the sequence given in the FASTA file, which is sometimes called the 'forward' strand, and its reverse complement, which is sometimes called the 'reverse' strand), but you only need to consider the forward strand of sequence 1. The easiest way to do this is to create and store in memory a single sequence of length N + 2M + 3 (where the two input sequences have lengths N and M bases respectively) constructed by concatenating together sequence 1's forward strand and sequence 2's forward and reverse strands (each terminated by a null character to terminate the strings, so that the lexicographic sort does not read across the sequence boundaries), and then create a single list of pointers to each position in this merged sequence, which you then sort. For each suffix in the forward strand of sequence 1, find the longest matching subsequence in the forward or reverse strand of sequence 2.
- Construct a histogram which indicates, for each length, the number of sequence 1 suffices having that match length to sequence 2. The histogram template can be seen in the template file below. If you sum the counts over all lengths, the total should equal the length of sequence 1. When reporting the histogram, eliminate lengths that have counts of 0.
- Also report the longest subsequence present in both sequences. If there are several different perfectly repeated subsequences of the same maximum length, find all of them. If a longest subsequence is present multiple times in either sequence, please report all locations of the subsequence.
Using this program, you should then find the match length histogram and longest exactly matching sequences in orthologous 10-megabase regions in the human and mouse genomes.
To test whether your program is working correctly, run it first on the test example indicated below (with two different bacterial genomes) to see whether you get the right answer. The FASTA files for the test examples can be found here and here. The General feature format files are here and here. In general, FASTA files for these and other bacterial genomes can be found by going to the NCBI web site and following appropriate links. (On the NCBI site, the FASTA files containing the full genome sequences have the suffix .fna. To find the biological features, look in the 'General feature format' or 'Genbank format' files for the organisms (which have the suffix .gff and .gbk on the NCBI site) and find the annotated 'feature' in each genome that overlaps the matching segment you found. You don't need to write a program to do this -- you can just read the .gff or .gbk file on the website.) .gff files are also searchable by 'bedtools intersect -a {your.gff} -b {region_of_interest.bed}'.
You must turn in your results and your computer program, using this file as a template. Please put everything into ONE file - do not send an archive of files or a tar file. After creating a plain text file (NOT a word processing document file) in this format, compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs, let us know), and send it as an attachment to both Phil at phg@uw.edu and Conor at concamp@uw.edu.

Details:

Fasta: put the name of the fasta file, along with the first line.

Extraneous characters: put the non-alphabetic character count for this fasta file (excluding the header line).

Nucleotide histogram: count the total number of bases ("*") and the number of times each specific base occurs. i.e.:

Match Length Histogram: report a histogram which indicates, for each length, the number of sequence 1 suffices having that match length to sequence 2. i.e.:

  Match Length Histogram:
  1 match count
  2 match count
  3 match count
  .
  .
  list all the match length counts
  (only first/last 3 shown here)
  .
  .
  20 match count
  21 match count
  22 match count

Description (for the longest match): Put a short identification of the shared DNA subsequence, for instance: "This DNA sequence is the first 20 bases of an RNA polymerase gene."

Position (for the longest match): Put the location of the matching subsequence within the input sequence here (rather than the absolute location in the given genome). The location should be the index of the DNA base in the sequence that is closest to the beginning of the forward strand. Use a coordinate system starting at 1 rather than 0. For example, if the two chromosomal strands are:

  			 5'-ACTGA-3'
  			 3'-TGACT-5'

and you found the subsequence TCA on the reverse strand to be the longest match to the other sequence, then the location should be reported as 3. If instead you found CTG on the forward strand, then the location should be reported as 2.