Genome 540 Homework Assignment 2

(Winter Quarter 2023)

Due Sunday Jan 22, 11:59pm

  1. Write a program that generates the FASTA files of two simulated sequences. This program should:
  2. The sequence names for your output sequences should indicate that they are simulated (and the model used). Note that the nucleotide and dinucleotide counts in your simulated sequences may differ slightly from those in the input, due to the randomization process.

  3. Use your program in #1 above to generate three simulated FASTA files based on the length and nucleotide or dinucleotide frequency of the 10-megabase mouse genomic region used in homework 1.

    Your output should include the information below for each of the four fasta files (the original and three simulated)

  4. All frequencies should be given to 4 decimal places; and you should use the nucleotide order A, C, G, T for the matrix rows and columns.

  5. Run your program from homework 1 three times. In all three runs, sequence 1 should be the 10-megabase human genomic region used in homework 1. In the first run, sequence 2 should be the simulated equal frequency sequence (from # 2 above). In the second run, sequence 2 should be the order-0 Markov model simulated sequence. In the third and final run, sequence 2 should be the order-1 Markov model simulated sequence. Your output for each run should be the same format as for homework 1 (with fasta file names added, per the template).
  6. Answer the following questions after the simulated histogram, as shown in the template file:
    1. Given your results, what can you conclude about the statistical significance of matches between the orthologous mouse and human regions in homework 1?
    2. Several interesting genes are located here. For example, the CPEB3 gene contains an intron-encoded self-cleaving ribozyme; Insulin Degrading Enzyme (IDE), an insulin protease has clinical significance for Alzheimer's disease. The HHEX gene encodes a member of the homeobox family of transcription factors, many of which are involved in developmental processes, etc. Using the longest match sequence(s) you found in Assignment 1, you should then try to figure out what biological feature they correspond to. You can do this by exploring the UCSC browser (hg38 and mm39). As a first step, it may be helpful to BLAST your longest match sequence(s) against the Human and Mouse genomes using this link
    3. Based on your histogram, how many UCEs are there in this region? Justify your answer. Hint: see this Science paper on "ultraconserved elements" (UCEs; >=200 bp exact match between mouse and human).
  7. You must turn in your results and your computer program (only the program in #1 above -- you don't need to turn in your HW1 program again!), using this file as a template. Please put everything into ONE file - do not send an archive of files or a tar file. After creating a plain text file (NOT a word processing document file) in this format, compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs, let us know), and send it as an attachment to both Phil at phg@uw.edu and Conor at concamp@uw.edu.