Skip to content

wkopp/BlueWhale

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
dnn
 
 
 
 
 
 
 
 
 
 
 
 

Dream challenge 2016

This project is dedicated for the participation in the dream challenge 2016. The aim of the challenge is to predict genome-wide transcription factor binding events (measured by ChIP-seq) within and across cell-types. To this end, we shall use deep neural networks to address this task, as they have recently proven themselves to be very powerful.

Set the environment variable to the data folder

Set the DREAM_DATA environment variable to the directory where the raw and preprocessed dataset should be located

export DREAM_DATA=/encode/data/dir

Software requirements

The BlueWhale-model was implemented in python-2.7 using the following libraries:

  • joblib (>=0.9.4)
  • numpy (>=1.10.4)
  • theano (>=0.9.0)
  • lasagne (>=0.2.dev1)
  • biopython (1.67) for the preprocessing only
  • rpy2
  • matplotlib
  • pandas
  • cuda-7.5
  • bedtools (for preprocessing)
  • bigWigAverageOverBed (from UCSC for preprocessing)

Detailed library installation instructions

Step-by-step instructions for installing the required python packages. Note that we install the version of Lasagne from master, so depending on when you install it will be a different version. We should fix this at some point so that everyone is working with the same version of the library.

$ wget https://bootstrap.pypa.io/get-pip.py
$ python get-pip.py --user
$ pip install --user -r https://raw.githubusercontent.com/Lasagne/Lasagne/master/requirements.txt
$ pip install --user https://github.com/Lasagne/Lasagne/archive/master.zip
$ pip install --user joblib
$ pip install --user rpy2
$ pip install --user matplotlib
$ pip install --user pandas
# Optionally upgrade Theano to a specific version (this isn't necessary right now)
#$ pip install --user --upgrade git+git://github.com/Theano/Theano.git@90c5034

Prerequisites for GPU usage

While Theano also facilitate processing on the CPU, the sheer size of the dataset and the model renders the massively parallelized computation with GPUs indispensable. Therefore, we recommend to set up Theano for GPU usage according to their installation instructions.

Preprocessing

To download and preprocess of the dataset invoke the following scripts:

python 01_download_data.py
bash 02_preprocess_data.bash

The result of those scripts are multiple `*.pkl' files which contain the dataset for training and performing predictions in numpy format. The directory structure after the preprocessing step should look like:

${DREAM_DATA}
|------------annotations
|            |-----------tf_names.txt   # contains the TF-names
|            |-----------cells.txt      # contains the cell-types
|------------ChIPseq
|------------essential_training_data
|------------RNAseq
|------------ladder_input
|            |-----------dhs-sum
|                        |------dhs.pkl # contains Dnase-fold-enrichment
|            |-----------dna
|                        |------dna.pkl # contains one-hot-DNA-representation
|------------test_input
|            |-----------dhs-sum
|                        |------dhs.pkl
|            |-----------dna
|                        |------dna.pkl
|------------train_input
             |-----------chip
                         |------chip.pkl # contains binary-valued TF-binding events
             |-----------dhs-sum
                         |------dhs.pkl
             |-----------dna
                         |------dna.pkl
             |-----------rna
                         |------rna.pkl   # contains gene-expression levels
                         |------tfrna.pkl # contains TF-expression levels

Model

The model is implemented using python by utilizing Theano and Lasagne.

The model is modular, meaning that it is made up of separate neural networks that are each trained on different input data sets (e.g. DNA sequence, DHS's, or RNA-seq data). Subsequently, we combined the models by removing the output-prediction layer of each network and using the layer underneath as input for another network. For the combined network, we freeze the weights in the layers below and just train the newly added layers on top. This reduces overfitting and additionally speeds up learning. This greedy layer-wise learning approach was inspired by the layer-wise pre-training using restricted Boltzman machines. However, instead of pre-training the layers in an unsupervised way, our pretraining utilizes a supervised pre-training.

Cross-celltype model

Training

To train the model change to your src-root directory (the directory this README.md file is in) and run:

# to train the predictor based on the DNA sequence only
python bluewhale.py DNA train training 30 
# to train the predictor based on the Dnase-fold-enrichment profile 
# across cell-types and the genomic profile within each cell-type using 9 times 200bp bins
python bluewhale.py DHSSUM train training 40 
# to train a predictor based on the current Dnase-fold-enrichment profile for the given cell-type
python bluewhale.py DHSSUMOOP train training 100 
# to train a network on top of the DNA and the DHSSUM top most feature activities
python bluewhale.py FULLAGG train training 35 
# to train a network on top of the FULLAGG, DHSSUMOPP top most 
# feature activities as well as a newly learned neural network part 
# that is based on the gene expression of the 32 TFs
#
# this model is used for the final submission for all TFs and all cell-types, except for F.JUND.liver
python bluewhale.py FULLRNA train training 20 
# to finetune all parameters across all layers simultanously
# from this model, only F.JUND.liver was used for the final submissions
python bluewhale.py FULLFINE train training 6

Note that the results can slightly different on different systems. Nevertheless, they should be be quite similar to the once shown above. Also, increasing the number of epochs still leads to an improvement of the performance. However, as training is a timeconsuming process, we stopped it for the sake of the above examples at the respective number of epochs.

Prediction

The model predictions for the ladderboard and the final test set were generated using

# Leaderboard predictions
python bluewhale.py FULLFINE ladder prediction
python bluewhale.py FULLRNA ladder prediction

# Final test set predictions
python bluewhale.py FULLFINE test prediction
python bluewhale.py FULLRNA test prediction

Within-celltype model

Training

To train the within cell-type model change to your src-root directory (the directory this README.md file is in) and run:

# to train the predictor based on the DNA sequence only
python bluewhale.py DNAWITHIN train training 30 
# to train the predictor based on the Dnase-fold-enrichment profile 
# across cell-types and the genomic profile within each cell-type using 9 times 200bp bins
python bluewhale.py DHSSUMWITHIN train training 40 
# to train a network on top of the DNA and the DHSSUM top most feature activities
python bluewhale.py FULLWITHIN train training 35 

Prediction

To produce predictions from the within cell-type model run

python bluewhale.py FULLWITHIN ladder prediction

About

This repository contains the BlueWhale deep neural network that was used in the ENCODE-DREAM challenge 2016

Resources

Stars

Watchers

Forks

Releases

No releases published