Genome 540 Homework Assignment 1

(Winter Quarter 2023)

Due Sunday Jan. 15, 11:59 pm

Late homework policy is described on course web page.

  1. Download and begin reading Initial sequencing and analysis of the human genome. The GenomeInternational Sequencing Consortium. Nature 409, 860-921 (15 February 2001) . (To print this out, I would recommend the pdf format which corresponds exactly to the printed version, rather than the html format.) For next week, read:
  2. Write a program that implements the method described in Lecture 2 to find, for each suffix in the 'forward' strand of sequence 1, the length of the longest matching subsequence in sequence 2 (or its reverse complement). It should report a histogram of these lengths, and also the longest exactly matching subsequence between two sequences. Specifically, your program should:
  3. Using this program, you should then find the match length histogram and longest exactly matching sequences in orthologous 10-megabase regions in the human and mouse genomes.
  4. To test whether your program is working correctly, run it first on the test example indicated below (with two different bacterial genomes) to see whether you get the right answer. The FASTA files for the test examples can be found here and here. The General feature format files are here and here. In general, FASTA files for these and other bacterial genomes can be found by going to the NCBI web site and following appropriate links. (On the NCBI site, the FASTA files containing the full genome sequences have the suffix .fna. To find the biological features, look in the 'General feature format' or 'Genbank format' files for the organisms (which have the suffix .gff and .gbk on the NCBI site) and find the annotated 'feature' in each genome that overlaps the matching segment you found. You don't need to write a program to do this -- you can just read the .gff or .gbk file on the website.) .gff files are also searchable by 'bedtools intersect -a {your.gff} -b {region_of_interest.bed}'.
  5. You must turn in your results and your computer program, using this file as a template. Please put everything into ONE file - do not send an archive of files or a tar file. After creating a plain text file (NOT a word processing document file) in this format, compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs, let us know), and send it as an attachment to both Phil at phg@uw.edu and Conor at concamp@uw.edu.

Details:

Fasta: put the name of the fasta file, along with the first line.

Extraneous characters: put the non-alphabetic character count for this fasta file (excluding the header line).

Nucleotide histogram: count the total number of bases ("*") and the number of times each specific base occurs. i.e.:

  *=1012800
  A=349623
  C=159237
  G=159490
  T=344450
  N=0
  

Match Length Histogram: report a histogram which indicates, for each length, the number of sequence 1 suffices having that match length to sequence 2. i.e.:

  Match Length Histogram:
  1 match count
  2 match count
  3 match count
  .
  .
  list all the match length counts
  (only first/last 3 shown here)
  .
  .
  20 match count
  21 match count
  22 match count
  

Description (for the longest match): Put a short identification of the shared DNA subsequence, for instance: "This DNA sequence is the first 20 bases of an RNA polymerase gene."

Position (for the longest match): Put the location of the matching subsequence within the input sequence here (rather than the absolute location in the given genome). The location should be the index of the DNA base in the sequence that is closest to the beginning of the forward strand. Use a coordinate system starting at 1 rather than 0. For example, if the two chromosomal strands are:

  			 5'-ACTGA-3'
  			 3'-TGACT-5'
  
and you found the subsequence TCA on the reverse strand to be the longest match to the other sequence, then the location should be reported as 3. If instead you found CTG on the forward strand, then the location should be reported as 2.