- Write a program that generates the FASTA files of two simulated sequences. This program should:
- read in a file in FASTA format
- determine the nucleotide counts and frequencies, the dinucleotide counts and frequencies, and the length, for the input sequence
- using an order-0 Markov model, produce a simulated sequence with the same length and nucleotide frequency as the input sequence, and then output a file in FASTA format containing the simulated sequence
- using an order-1 Markov model, produce a simulated sequence with the same length and dinucleotide frequency as the input, and then output a file in FASTA format containing the simulated sequence
The sequence names for your output sequences should indicate that they are simulated (and the model used). Note that the nucleotide and dinucleotide counts in your simulated sequences may differ slightly from those in the input, due to the randomization process.
- Use your program in #1 above to generate two simulated FASTA files based on the length and nucleotide or dinucleotide frequency of the 10-megabase mouse genomic region used in homework 1.
Your output should include the information below for each of the three fasta files (the original and two simulated)
- the name and first line of the fasta file
- the nucleotide count histogram
- the nucleotide frequencies
- the dinucleotide count matrix
- the dinucleotide frequency matrix
- the conditional frequency matrix
All frequencies should be given to 4 decimal places; and you should use the nucleotide order A, C, G, T for the matrix rows and columns.
- Run your program from homework 1 twice. In both runs, sequence 1 should be the 10-megabase human genomic region used in homework 1. In the first run, sequence 2 should be the order-0 Markov model simulated sequence (from # 2 above). In the second run, sequence 2 should be the order-1 Markov model simulated sequence. Your output for each run should be the same format as for homework 1.
- Answer this question after the simulated histogram, as shown in the template file:
- Given your results, what can you conclude about the statistical significance of matches between the orthologous mouse and human regions in homework 1?
- You must turn in your results and your computer program (only the program in #1 above -- you don't need to turn in your HW1 program again!), using this file as a template. Please put everything into ONE file - do not send an archive of files or a tar file. After creating a plain text file (NOT a word processing document file) in this format, compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs, let us know), and send it as an attachment to both Phil at phg@uw.edu and CX at cxqiu@uw.edu.