Genome 540 Homework Assignment 2

Due Sunday Jan 24, 11:59pm

Write a program that generates a FASTA file of a simulated genome. This program should:
- read in a file in FASTA format (i.e. having a header line which starts with the character ">" followed immediately by the sequence name, with the sequence itself following on subsequent lines)
- determine the frequencies of the nucleotides (based on the forward and reverse strands) and the length of the sequence
- using an order-0 Markov model, produce a simulated sequence with the same length and nucleotide composition (the counts will be slightly different due to the randomization process)
- output a file in FASTA format containing the simulated sequence (with a sequence name indicating that it is simulated!)
Modify your program from homework 1 to find, for each suffix in the 'forward' strand of genome 1, the length of the longest matching subsequence in genome 2 (or its reverse complement), and to report a histogram of these lengths. Your program should:
- read two input files in FASTA format, each of which contains one of the genomes to be compared. As a data check, your program should count the number of bases of each type (i.e. A, C, T, G, or N) in each sequence and give their frequencies (to 4 decimal places, based on the forward and reverse strands).
- use the described algorithm to find the longest matching subsequence in genome 2 for each suffix in genome 1. When you are looking for these in this assignment, you should consider both strands of genome 2 (i.e. both the sequence given in the FASTA file, which is sometimes called the 'forward' strand, and its reverse complement, which is sometimes called the 'reverse' strand), but you only need to consider the forward strand of genome 1. The easiest way to do this is to create and store in memory a single sequence of length N + 2M + 3 (where the two input sequences have lengths N and M bases respectively) constructed by concatenating together genome 1's forward strand and genome 2's forward and reverse strands (each terminated by a null character to terminate the strings, so that the lexicographic sort does not read across the sequence boundaries), and then create a single list of pointers to each position in this merged sequence, which you then sort. For each suffix in the forward strand of genome 1, find the longest matching subsequence in the forward or reverse strand of genome 2.
- Construct a histogram which indicates, for each length, the number of genome 1 suffices having that match length to genome 2. The histogram template can be seen in the template file below. If you sum the counts over all lengths, the total should equal the length of genome 1. When reporting the histogram, eliminate lengths that have counts of 0.
Use your program in #1 above to generate a simulated FASTA based on the length and nucleotide frequency of Advenella kashmirensis.
Run your program in #2 above twice. In both runs, genome 1 should be Bordetella bronchiseptica. In the first run, genome 2 should be Advenella kashmirensis. In the second run, genome 2 should be the simulated genome from #3 above.
Your output for each run of your program in #4 above should include:
- the name and first line of the .fna file
- the nucleotide count histogram
- the background nucleotide frequency (to 4 decimal places, based on the forward and reverse strands)
- the match length histogram (formatted as {length} {match count})
Answer this question after the simulated histogram, as shown in the template file:
- Given your results, what can you conclude about the statistical significance of matches between the Bordetella bronchiseptica and Advenella kashmirensis genomes?
You must turn in your results and your computer programs, using this template file. Please put everything into ONE plain text file - do not send an archive of files or a tar file, or a word processing document file. Compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs let us know), and send it as an attachment to both Phil (phg (at) uw.edu) and Dani (dfaivre (at) uw.edu).