Genome 540 Homework Assignment 2

(Winter Quarter 2023)

Due Sunday Jan 22, 11:59pm

Write a program that generates the FASTA files of two simulated sequences. This program should:
- read in a file in FASTA format
- determine the nucleotide counts and frequencies, the dinucleotide counts and frequencies, and the length, for the input sequence
- using the equal frequency assumption, produce a simulated sequence with the same length and then output a file in FASTA format containing the simulated sequence
- using an order-0 Markov model, produce a simulated sequence with the same length and nucleotide frequency as the input sequence, and then output a file in FASTA format containing the simulated sequence
- using an order-1 Markov model, produce a simulated sequence with the same length and dinucleotide frequency as the input, and then output a file in FASTA format containing the simulated sequence

The sequence names for your output sequences should indicate that they are simulated (and the model used). Note that the nucleotide and dinucleotide counts in your simulated sequences may differ slightly from those in the input, due to the randomization process.

Use your program in #1 above to generate three simulated FASTA files based on the length and nucleotide or dinucleotide frequency of the 10-megabase mouse genomic region used in homework 1.
Your output should include the information below for each of the four fasta files (the original and three simulated)
- the name and first line of the fasta file
- the nucleotide count histogram
- the nucleotide frequencies
- the dinucleotide count matrix
- the dinucleotide frequency matrix
- the conditional frequency matrix

All frequencies should be given to 4 decimal places; and you should use the nucleotide order A, C, G, T for the matrix rows and columns.

Run your program from homework 1 three times. In all three runs, sequence 1 should be the 10-megabase human genomic region used in homework 1. In the first run, sequence 2 should be the simulated equal frequency sequence (from # 2 above). In the second run, sequence 2 should be the order-0 Markov model simulated sequence. In the third and final run, sequence 2 should be the order-1 Markov model simulated sequence. Your output for each run should be the same format as for homework 1 (with fasta file names added, per the template).
Answer the following questions after the simulated histogram, as shown in the template file:

Given your results, what can you conclude about the statistical significance of matches between the orthologous mouse and human regions in homework 1?
Several interesting genes are located here. For example, the CPEB3 gene contains an intron-encoded self-cleaving ribozyme; Insulin Degrading Enzyme (IDE), an insulin protease has clinical significance for Alzheimer's disease. The HHEX gene encodes a member of the homeobox family of transcription factors, many of which are involved in developmental processes, etc. Using the longest match sequence(s) you found in Assignment 1, you should then try to figure out what biological feature they correspond to. You can do this by exploring the UCSC browser (hg38 and mm39). As a first step, it may be helpful to BLAST your longest match sequence(s) against the Human and Mouse genomes using this link
Based on your histogram, how many UCEs are there in this region? Justify your answer. Hint: see this Science paper on "ultraconserved elements" (UCEs; >=200 bp exact match between mouse and human).

You must turn in your results and your computer program (only the program in #1 above -- you don't need to turn in your HW1 program again!), using this file as a template. Please put everything into ONE file - do not send an archive of files or a tar file. After creating a plain text file (NOT a word processing document file) in this format, compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs, let us know), and send it as an attachment to both Phil at phg@uw.edu and Conor at concamp@uw.edu.