Genome 540 Homework Assignment 8

Due Sun Mar 8

Write a program to predict the expression of a gene given measurements about the activity at its promoter.
In order to avoid making you process the (very large) genomics assays files, we have extracted the data at each transcription start site for you. Here's what we did:
- Downloaded transcription start site (TSS) and CAGE data on 33 ENCODE cell lines from Gencode (version 10). Transformed CAGE values with the inverse hyperbolic sine transfrom (similar to a log transform) to damp down the effect of large outliers. Removed TSSs with a sum of CAGE values across cell lines of less than 5, in order to get rid of bad annotations. Called a TSS "expressed" if its GM1878 CAGE value is greater than the median across the TSSs. GM12878 is a human lymphoblastoid cell line.
- Downloaded data on 10 genomics assays in GM12878 from ENCODE, represented in bigwig format, which species a value at each (mappable) position in the genome. Extracted the values for each assay in the region 100 bp up- and down-stream of each TSS. Filled in unmappable positions with 0, transformed these values with the inverse hyperbolic sine transform, and averaged over the window to produce a single value for each assay at each TSS.
- Randomly split TSSs into a training set (70%) and testing set (30%).
Again, this is just for your knowledge -- you don't need to do any of the steps above.
The input files are here: training set and testing set. The first four columns specify the location and strand of the TSS in question. The fifth column, "expressed", specifies whether or not this TSS is expressed according to our definition. You should use this column as the label. The remaining columns specify 11 features for your classifier to use (10 genomics assays plus an intercept feature). The intercept feature is 1 for all TSSs. The parameter associated with this feature determines the predicted probability of expression for a TSSs with zeros in all the other features. Without such a feature, the model would always give such TSSs a 50% probability of being expressed.
Implement a logistic regression classifier to run on this data set. As described in class, your classifer will have one parameter for each feature. Train your classifier on the training set, using the gradient descent algorithm described in class to optimize the parameters on the training data. Initialize the parameters to all be zero. Use a learning rate (alpha in the lecture notes) of 1e-5. Stop training when the difference in log likelihood between iterations is less than 1.0.
Your output should provide:

The final parameters of the model (three decimal places).
The final log likelihood on the training and test sets (base e, three decimal places).
Classify examples with greater than 50% probability as "expressed". Report your accuracy (fraction of TSSs correctly classified) on both the training and test sets (three decimal places).
Explain why you think the top two most positively and top two most negative weighted features were chosen by the model as such. You might find this table of histone modification associations helpful. If you find that the intercept feature receives a large-magnitude weight, what does that say about TSSs with low values for all 10 genomics assays?
The approximate number of hours you spend on this homework (so we can calibrate the difficulty for future years).

You must turn in your results and your computer program, using this template file . Please note that the template file references the wrong chromosome start location, don't let this confuse you! Please put everything into ONE plain text file - do not send an archive of files or a tar file, or a word processing document file. Compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs let us know), and send it as an attachment to both Phil (phg (at) u.washington.edu) and Max (maxwl (at) cs.washington.edu). Name your file "[your first initial][your last name]_hw8.txt.[compression extension (i.e. .gz)]".